Variable importance-weighted Random Forests
Yiyi Liu, Hongyu Zhao
Variable importance-weighted Random Forests
Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest.
Methods: We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features.
Results: We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases.
Conclusions: By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package “viRandomForests” based on the original R package “randomForest” and it can be freely downloaded from http://zhaocenter.org/software.
Random Forests / variable importance score / classification / regression
[1] |
Hanahan, D. and Weinberg, R. A. (2011) Hallmarks of cancer: the next generation. Cell, 144, 646–674
CrossRef
Pubmed
Google scholar
|
[2] |
Breiman, L. (2001) Random forests. Mach. Learn., 45, 5– 32
CrossRef
Google scholar
|
[3] |
Palmer, D. S., O’Boyle, N. M., Glen, R. C. and Mitchell, J. B. (2007) Random forest models to predict aqueous solubility. J. Chem. Inf. Model., 47, 150–158
CrossRef
Pubmed
Google scholar
|
[4] |
Jiang, P., Wu, H., Wang, W., Ma, W., Sun, X., Lu, Z. (2007) MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res. 35, W339–W344
|
[5] |
Lee, J. W., Lee, J. B., Park, M., Song, S. H.(2005) An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data Anal. 48, 869–885
|
[6] |
Goldstein, B. A., Polley, E. C. and Briggs, F. B. (2011) Random forests for genetic association studies. Stat. Appl. Genet. Mol. Biol., 10, 32
CrossRef
Pubmed
Google scholar
|
[7] |
Amaratunga, D., Cabrera, J. and Lee, Y. S. (2008) Enriched random forests. Bioinformatics, 24, 2010–2014
CrossRef
Pubmed
Google scholar
|
[8] |
Granitto, P. M., Furlanello, C., Biasioli, F. and Gasperi, F. (2006) Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometr. Intell. Lab., 83, 83–90
CrossRef
Google scholar
|
[9] |
Svetnik, V., Liaw, A., Tong, C. and Wang, T. (2004) Application of Breiman’s random forest to modeling structure-activity relationships of pharmaceutical molecules. Lect. Notes Comput. Sci., 3077, 334–343
CrossRef
Google scholar
|
[10] |
Díaz-Uriarte, R. and de Andrés, S.A. (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3
CrossRef
Pubmed
Google scholar
|
[11] |
Breiman, L. (2001) Statistical modeling: the two cultures. Stat. Sci., 16, 199–231
CrossRef
Google scholar
|
[12] |
Amaratunga, D. and Cabrera, J. (2009) A conditional t suite of tests for identifying differentially expressed genes in a DNA microarray experiment with little replication. Stat. Biopharm. Res., 1, 26–38
CrossRef
Google scholar
|
[13] |
Biau, G. (2012) Analysis of a random forests model. J. Mach. Learn. Res., 13, 1063–1095
|
[14] |
Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A. A., Kim, S., Wilson, C. J., Lehár, J., Kryukov, G. V., Sonkin, D.,
CrossRef
Pubmed
Google scholar
|
[15] |
Guyon, I., Gunn, S., Ben-Hur, A. and Dror, G. (2004) Result Analysis of The Nips 2003 Feature Selection Challenge. In Proceeding NIPS’04 Proceedings of the 17th International Conference on Neural Information Processing Systems. pp. 545–552
|
[16] |
Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y., Goumnerova, L. C., Black, P. M., Lau, C.,
CrossRef
Pubmed
Google scholar
|
[17] |
Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo P., Renshaw, A. A., D’Amico, A. V., Richie, J. P.,
CrossRef
Pubmed
Google scholar
|
/
〈 | 〉 |