Variable importance-weighted Random Forests

Yiyi Liu , Hongyu Zhao

Quant. Biol. ›› 2017, Vol. 5 ›› Issue (4) : 338 -351.

PDF (3398KB)
Quant. Biol. ›› 2017, Vol. 5 ›› Issue (4) : 338 -351. DOI: 10.1007/s40484-017-0121-6
RESEARCH ARTICLE
RESEARCH ARTICLE

Variable importance-weighted Random Forests

Author information +
History +
PDF (3398KB)

Abstract

Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest.

Methods: We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features.

Results: We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases.

Conclusions: By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package “viRandomForests” based on the original R package “randomForest” and it can be freely downloaded from http://zhaocenter.org/software.

Graphical abstract

Keywords

Random Forests / variable importance score / classification / regression

Cite this article

Download citation ▾
Yiyi Liu, Hongyu Zhao. Variable importance-weighted Random Forests. Quant. Biol., 2017, 5(4): 338-351 DOI:10.1007/s40484-017-0121-6

登录浏览全文

4963

注册一个新账户 忘记密码

INTRODUCTION

With the rapid development of molecular technologies, huge amount of high-throughput− omics data have been generated. These data provide rich information on various biological processes at the molecular level, and insights learned from these data may lead to new tools for disease diagnosis, prognosis, and treatment [1]. However, the large number of features in these data and the existence of complex interactions among these features pose great challenges in extracting useful information for accurate predictions. Random Forests [2], an ensemble method based on classification and regression trees (CART) trained on bootstrapped samples and randomly selected features, has been shown to have superior performance over many other classification and regression methods [24] and is commonly used in genomic data analyses [5,6]. However, when the number of features is very large and the signals are relatively weak, its performance tends to decline (see Results and [7] for examples).

An intuitive idea to improve the performance of Random Forests is to evaluate the importance of each feature first and then only keep the most informative ones in a second round of analysis. This is the core idea of several feature elimination Random Forests algorithms [810]. As a by-product of Random Forests, the variable importance score (increase in classification error rate or regression MSE when a feature is randomly permutated) [11] provides an assessment of the informativeness of each feature. The feature elimination methods utilize this measurement and iteratively select top-ranked features accordingly and re-train a Random Forests based only on selected features. While showing improvement in some cases [810], the main limitation of this feature elimination approach is that it is too rigid in feature selection, sensitive to inaccuracies in feature selection, and may lead to significant increase in the correlation among trees that may negatively affect the performance (see Results).

To overcome this limitation, we propose a soft “feature selection” strategy in this paper. Unlike the feature elimination approach which keeps only features with the largest importance scores, we input all the features as well as their importance scores into a second stage Random Forests model. However, in the random feature selection step when splitting a tree node, instead of sampling each feature with equal probability like Random Forests, our new method samples features according to their importance scores. With this weighted sampling strategy, the final model is able to focus on the most informative features while not completely ignoring contributions from others at the same time. We note that a similar method was proposed specifically for continuous-feature, two-class classification problems (called “enriched Random Forests”) [7]. The authors used marginal t-test (or conditional t-test [12]) q-values to guide the random feature selection. Here, we also extended this marginal testing idea to more general cases by adopting ANOVA for continuous-feature, multiple-class classification, Chi-squared test for categorical-feature classification and F-test (linear regression with only one feature) for regression.

We evaluated the performance of the proposed variable importance-weighted Random Forests (viRF), the standard Random Forests, the feature elimination Random Forests and the marginal screening-based enriched Random Forests through comprehensive simulation studies and the analysis of gene expression data sets. We found that the viRF has better performance in most cases. These results suggest that viRF is effective in using high-dimensional genomic data to construct useful predictive models.

RESULTS

Regression

Simulation models

We consider the following three models in our simulations [13].

1. y=10sin(10π x1)+ε, with xi Unif(0,1 ), i=1, 2, ,d and ε N(0,1).

2. y=10sin(π x1 x2)+20 (x 30.05)2+10x4+ 5x5+ε, with xi Unif(0,1 ), i =1, 2, , d and ε N(0,1).

3. y=f(x)+ε, with x iUnif(0,1) , i=1, 2, ,d, ε N(0,1), and f (x) follows a tree structure as in Figure 1.

The total number of features, d, was varied from 5 to 200. For each d, we generated training sets with sample size, n, ranging from 10 to 500 (Figures 2–4). The accuracy of each method was evaluated on a testing data set with 500 samples. We repeated the process 50 times and report the average MSEs, where smaller value indicates better performance.

In simulation Model 1, out of d total features only one ( x1) is informative. The variable importance-weighted Random Forests (viRF) and the feature elimination ones (feRF), being less influenced by noise features, both outperformed the standard Random Forests (RF) as expected (Figure 2). When sample size (n) was large relative to the dimension (d), viRFs performed better than feRFs. We also note that although linear regression was not expected to be effective for the periodic function ( sin), F-test q-value based enriched Random Forests (eRF) still outperformed the standard RF in most cases (since x1 did get smaller p-values than x2, ,xd). However, its performance was much worse than viRF, indicating an advantage of Random Forests’ variable importance score quantifying the feature informativeness in this scenario.

Simulation Model 2 has five informative features (x1,, x5). For most d and n, all the modified Random Forests had similar performance, which were better than that of the standard RF (Figure 3).

For the tree structure in Model 3, with five informative features (x1,, x5), viRF performed the best in almost all cases (Figure 4).

We also consider three Models that have similar functional forms as Models 1‒3, except with both continuous and categorical features (Supplementary Figures S1). In addition, we consider cases where a certain level of correlation exists between the effective and nuisance features. The relative performance of all the methods was similar to what we observed in Models 1-3 (Supplementary Figures S2–S7).

Overall, these simulation results suggest an improved performance of viRF over standard RF and eRF (with linear model F-test weights) in regression. In addition, the feature weighting strategy also performed better than the feature elimination one in most scenarios except when the number of informative features and the sample size were both very small.

Drug sensitivity prediction

We further assessed the performance of these regression methods using the CCLE drug sensitivity data [14]. The CCLE data contain gene expression profiles of ~500 cancer cell lines and their sensitivities to 24 anticancer drugs. For each drug, we constructed a regression model with the sensitivity measurement (area under dose response curve) as the response variable for a cell line and features as the cell line’s expression levels (10,000 genes). The MSEs estimated with 20 rounds of 5-fold cross-validation are shown in Table 1.

We observed that for all drugs other than “AZD0530” and “Paclitaxel” the best-performing methods were always among the weighted Random Forests (viRF or eRF based on marginal F-test q-values), yet no clear conclusion could be drawn between these two approaches. This may suggest that in practice it would be a good strategy to have features weighted according to their informativeness, such that the final model will be more dominated by key features, while at the same time not completely ignore the less informative ones to stay robust to inaccuracies occurred in the evaluations of feature importance and effective dimension.

Classification

Simulation models

We generated three simulated data sets analogous to those in the regression analyses to evaluate the performances of the methods in classification. In simulation Model 1, we consider a continuous-feature, two-class classification problem with only one informative feature ( x1). The coefficients of the logistic function were selected such that the two classes would have balanced sample sizes. In simulation Model 2, the number of informative features increases to five (x1,,x5). In simulation Model 3, we consider a tree structure with five informative continuous features (x1,, x5) and four classes.

1. yBern oulli (1 1+exp (1 2x1)), with xiUnif( 0,1), i= 1, 2, ,d.

2. yBern oulli (1 1+exp ( 10sin (π x1 x2)+20 (x3 0.05)2+10x4+ 5x520 3)), with xi Unif(0,1 ), i =1, 2, , d.

3. y follows a tree structure as in Figure 5, with xiU nif( 0,1), i= 1, 2, ,d.

We plot the classification error rates of all the methods with a range of training sample sizes in Figures 6–8. Unlike what was observed in the regression cases, the feature elimination Random Forests (feRF) almost consistently performed worse than the other methods. The variable importance-weighted Random Forests (viRF) and the enriched Random Forests (eRF, marginal q-value weighted) both achieved the lowest prediction error rates in most cases, while viRF was the single best classifier in the rest. It is worth noting for Models 3 with tree structures, the standard Random Forests (RF), as an ensemble of trees, performed very well when the dimension is low (occasionally even slightly better than the weighted Random Forests); however, as the number of uninformative features increased, its performance declined and the improvement by adding a weight to the features became clear (Figure 8).

Additionally, we also considered three cases where both continuous and categorical features exist (Supplementary Figures S8) and three cases where some nuisance features are correlated with the effective features. The observations were similar to what we got on Models 1–3 (Supplementary Figures S9–S14).

In order to get more insights about these different performances, we investigated the strength of individual trees and the correlation between trees [2] for each method (Supplementary Figures S15–S23). Generally, Random Forests achieves a small correlation between trees while maintaining individual tree’s strength at the same time to give accurate classification [2]. We note that compared to the standard RF, the weighted ones (viRF and eRF) would have both increased strengths and correlations; and since their effect in strength improvement was greater than that in correlation deterioration, the overall performance was boosted. However, for feRF, the same increment in single-tree strength often brought much larger sacrifice to the correlation side, which resulted in worse performance overall.

Cancer (subtype) classification

We then assessed the performance of these methods for cancer/normal and cancer subtype classification. We considered three data sets (Table 2) with various sample sizes and input types.

We performed 5-fold cross validation for 20 rounds and report the average prediction errors in Table 3. In all these three data sets, viRF performed the best. ERF (marginal test q-value weighted) also achieved better performance than the standard RF. However, similar to what we observed in the simulations, feRF performed poorly, even worse than the standard RF.

DISCUSSION

We have proposed a variable importance-weighted Random Forests (viRF) that utilizes the variable importance scores obtained from a standard Random Forests to sample features in a weighted Random Forests. This enables the final model to rely more on informative features, hence can better deal with the growing noises as dimension increases than standard Random Forests.

Unlike the feature elimination Random Forests that removes features with small importance scores entirely, our strategy allows features with weaker information to be considered in the final model, thus it is more flexible and less prone to the inaccuracies that might occur in feature evaluation and selection steps. Especially when interactions exist between features (Simulation Model 3 and real biological data), the weighted feature sampling method is more able to capture interactions from features that might be less important marginally, and has superior performance. Besides, by avoiding the internal cross-validation that is required by feature elimination Random Forests to determine the optimal number of features, the computational burden for the weighted Random Forests is greatly reduced.

In addition, we extended a previous idea (enriched Random Forests) for continuous-feature, two-class classification problem to more general cases by utilizing the corresponding marginal test q-values to guide the random feature selection. Though theoretically such marginal tests may not provide an ideal evaluation of the features’ informativeness, in our real data analyses sometimes it presented promising results as well.

In summary, sampling features according to their informativeness in the random feature selection step could further enhance Random Forests’ performance, in both regression and classification. Since variable importance score estimated by Random Forests as the increased MSE/ error rate when a feature is random permuted, provides a reliable measurement of the feature’s relevance to the outcome, we consider it a useful weight to be assigned for features. In our R implementation of the weighted Random Forests, we set default weight as the variable importance score; we also give users an option to specify other weights to use.

MATERIALS AND METHODS

Random Forests and variable importance score

Random Forests is an ensemble of classification and regression trees (CART) [2]. Each tree is grown on a bootstrapped sample from the original data set. At each node, m out of p total features are randomly selected (random feature selection) and the best split is chosen from them. The constructed trees vote for the most popular class (classification) or the mean predicted value (regression). For Random Forests classifier, an upper bound of its generalization error can be derived in terms of strength of individual trees and correlation between them [2]:

P E* ρ ¯(1 s2)s 2,
where PE* is the generalization error, ρ ¯ is the average pairwise correlation between trees and s is the average single tree strength, defined as follows.

P E*=PX ,Y( PΘ(h (X,Θ)=Y)maxjYP Θ( h(X,Θ)=j )<0),

s= EX,Y [P Θ( h(X,Θ)=Y)max jYPΘ(h (X,Θ)=j)],

ρ ¯=E Θ,Θ[co r(I(h(X,Θ)=Y)I(h (X ,Θ)=j ^(X,Y)), I(h (X ,Θ')=Y )I(h(X,Θ ')=j^(X,Y)))],
where X, Y are the features and true class, Θ and h(X, Θ) represent a tree and its predicted class of X, and j^ (X ,Y)=arg max jYP Θ( h(X,Θ) =j).

Random Forests can also generate a variable importance score for each feature, which is evaluated as the increased classification error rate or regression MSE (measured on out-of-bag samples, i.e., samples not used to grow a tree) when a feature is randomly permuted [11]. Intuitively, permutation of an informative feature will lead to large increment in classification error rate or regression MSE while permutation of a non-informative feature will not influence the model’s performance much. A slightly modified version of variable importance score takes uncertainty of the mean increased error rate/ MSE (across trees) into account by normalizing the value with the standard deviation. We considered both the unnormalized and normalized variable importance scores in our analyses, and the results did not suggest major performance difference.

Variable importance-weighted Random Forests

We propose a two-stage variable importance-weighted Random Forests (viRF) method. For the first stage, we run a standard Random Forests and obtain a variable importance score for each feature wi, i=1, 2, , d. Considering that it is possible for some wi to be zero or negative, we transform these importance scores as follows:

w ˜i={ 1d+w i max jwj ,if maxjwj>0 1 d, otherwise.

In the second stage, we construct another Random Forests. However, instead of sampling m features with equal probability for each feature in the random feature selection step, we sample m features with probability proportional to  w ˜i, i=1, 2, , d. The best split is chosen from these m features.

Other methods

Besides Random Forests’ variable importance score, we also consider a marginal testing approach to assessing the importance of a feature and perform the weighted feature sampling using the q-values of t-test (continuous-feature, two-class classification), ANOVA (continuous-feature, multiple-class classification), Chi-squared test (categorical-feature classification) or linear regression F-test (regression), following the idea of [7].

We compare the weighted Random Forests with feature elimination Random Forests [810]. At each iteration r % (we used r =30 here) of the features with the least importance scores are removed and a Random Forests is constructed using the remaining ones. The number of features used in the final model is determined by cross validation. We consider both recursive (variable importance scores are updated at each iteration) and non-recursive (variable importance scores are calculated only once at the first iteration) approaches for feature elimination, and the results did not suggest an obvious difference.

Data sets

We downloaded the CCLE data set from https://portals.broadinstitute.org/ccle/home. To reduce computational cost we kept 10,000 genes with the largest expression coefficients of variations in our analyses. We downloaded the classification data sets from http://archive.ics.uci.edu/ml/datasets/Arcene, http://stat.ethz.ch/%7Edettling/bagboost.html and http://stat.ethz.ch/%7Edettling/bagboost.html. All features included in the original data sets were used.

References

[1]

Hanahan, D. and Weinberg, R. A. (2011) Hallmarks of cancer: the next generation. Cell, 144, 646–674

[2]

Breiman, L. (2001) Random forests. Mach. Learn., 45, 5– 32

[3]

Palmer, D. S., O’Boyle, N. M., Glen, R. C. and Mitchell, J. B. (2007) Random forest models to predict aqueous solubility. J. Chem. Inf. Model., 47, 150–158

[4]

Jiang, P., Wu, H., Wang, W., Ma, W., Sun, X., Lu, Z. (2007) MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res. 35, W339–W344

[5]

Lee, J. W., Lee, J. B., Park, M., Song, S. H.(2005) An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data Anal. 48, 869–885

[6]

Goldstein, B. A., Polley, E. C. and Briggs, F. B. (2011) Random forests for genetic association studies. Stat. Appl. Genet. Mol. Biol., 10, 32

[7]

Amaratunga, D., Cabrera, J. and Lee, Y. S. (2008) Enriched random forests. Bioinformatics, 24, 2010–2014

[8]

Granitto, P. M., Furlanello, C., Biasioli, F. and Gasperi, F. (2006) Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometr. Intell. Lab., 83, 83–90

[9]

Svetnik, V., Liaw, A., Tong, C. and Wang, T. (2004) Application of Breiman’s random forest to modeling structure-activity relationships of pharmaceutical molecules. Lect. Notes Comput. Sci., 3077, 334–343

[10]

Díaz-Uriarte, R. and de Andrés, S.A. (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3

[11]

Breiman, L. (2001) Statistical modeling: the two cultures. Stat. Sci., 16, 199–231

[12]

Amaratunga, D. and Cabrera, J. (2009) A conditional t suite of tests for identifying differentially expressed genes in a DNA microarray experiment with little replication. Stat. Biopharm. Res., 1, 26–38

[13]

Biau, G. (2012) Analysis of a random forests model. J. Mach. Learn. Res., 13, 1063–1095

[14]

Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A. A., Kim, S., Wilson, C. J., Lehár, J., Kryukov, G. V., Sonkin, D., (2012) The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483, 603–607

[15]

Guyon, I., Gunn, S., Ben-Hur, A. and Dror, G. (2004) Result Analysis of The Nips 2003 Feature Selection Challenge. In Proceeding NIPS’04 Proceedings of the 17th International Conference on Neural Information Processing Systems. pp. 545–552

[16]

Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y., Goumnerova, L. C., Black, P. M., Lau, C., (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415, 436–442

[17]

Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1, 203–209

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag Berlin Heidelberg

AI Summary AI Mindmap
PDF (3398KB)

Supplementary files

QB-07121-OF-ZHY_suppl_1

3185

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/