Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation
Yawei Li, Yuan Luo
Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation
Background: With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis.
Methods: We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier’s weight multiplied by its predicted probability.
Results: Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy.
Conclusion: This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification.
cancer type classification / ensemble method / performance-weighted-voting model / linear regression / single-nucleotide polymorphism
[1] |
Vogelstein, B. and Kinzler, K. W. (2004) Cancer genes and the pathways they control. Nat. Med., 10, 789–799
CrossRef
Pubmed
Google scholar
|
[2] |
Knudson, A. G. (2002) Cancer genetics. Am. J. Med. Genet., 111, 96–102
CrossRef
Pubmed
Google scholar
|
[3] |
Ling, S., Hu, Z., Yang, Z., Yang, F., Li, Y., Lin, P., Chen, K., Dong, L., Cao, L., Tao, Y.,
CrossRef
Pubmed
Google scholar
|
[4] |
Zhang, Y., Li, Y., Li, T., Shen, X., Zhu, T., Tao, Y., Li, X., Wang, D., Ma, Q., Hu, Z.,
CrossRef
Pubmed
Google scholar
|
[5] |
Bozic, I., Antal, T., Ohtsuki, H., Carter, H., Kim, D., Chen, S., Karchin, R., Kinzler, K. W., Vogelstein, B. and Nowak, M. A. (2010) Accumulation of driver and passenger mutations during tumor progression. Proc. Natl. Acad. Sci. USA, 107, 18545–18550
CrossRef
Pubmed
Google scholar
|
[6] |
Hu, Z., Ding, J., Ma, Z., Sun, R., Seoane, J. A., Scott Shaffer, J., Suarez, C. J., Berghoff, A. S., Cremolini, C., Falcone, A.,
CrossRef
Pubmed
Google scholar
|
[7] |
Yachida, S., Jones, S., Bozic, I., Antal, T., Leary, R., Fu, B., Kamiyama, M., Hruban, R. H., Eshleman, J. R., Nowak, M. A.,
CrossRef
Pubmed
Google scholar
|
[8] |
Yates LR, Knappskog S, Wedge D, Farmery JHR, Gonzalez S, Martincorena I, Alexandrov LB, Van Loo P, Haugland HK, Lilleng PK,
|
[9] |
Varadhachary, G. R. and Raber, M. N. (2014) Cancer of unknown primary site. N. Engl. J. Med., 371, 757–765
CrossRef
Pubmed
Google scholar
|
[10] |
Hudson, T. J., Anderson, W., Artez, A., Barker, A. D., Bell, C., Bernabé, R. R., Bhan, M. K., Calvo, F., Eerola, I., Gerhard, D. S.,
CrossRef
Pubmed
Google scholar
|
[11] |
The Cancer Genome Atlas Research N, Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C. and Stuart, J.M. (2013)The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet., 45, 1113–1120
|
[12] |
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. (2020) Pan-cancer analysis of whole genomes. Nature, 578, 82–93
CrossRef
Pubmed
Google scholar
|
[13] |
Alexandrov, L. B., Nik-Zainal, S., Wedge, D. C., Aparicio, S. A., Behjati, S., Biankin, A. V., Bignell, G. R., Bolli, N., Borg, A., Børresen-Dale, A. L.,
CrossRef
Pubmed
Google scholar
|
[14] |
Burrell, R. A., McGranahan, N., Bartek, J. and Swanton, C. (2013) The causes and consequences of genetic heterogeneity in cancer evolution. Nature, 501, 338–345
CrossRef
Pubmed
Google scholar
|
[15] |
Cicchetti, D. V. (1992) Neural networks and diagnosis in the clinical laboratory: state of the art. Clin. Chem., 38, 9–10
CrossRef
Pubmed
Google scholar
|
[16] |
Cochran, A. J. (1997) Prediction of outcome for patients with cutaneous melanoma. Pigment Cell Res., 10, 162–167
CrossRef
Pubmed
Google scholar
|
[17] |
Cruz, J. A. and Wishart, D. S. (2007) Applications of machine learning in cancer prediction and prognosis. Cancer Inform, 2, 59–77
Pubmed
|
[18] |
Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V. and Fotiadis, D. I. (2015) Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J., 13, 8–17
CrossRef
Pubmed
Google scholar
|
[19] |
Eraslan, G., Avsec, Ž., Gagneur, J. and Theis, F. J. (2019) Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet., 20, 389–403
CrossRef
Pubmed
Google scholar
|
[20] |
Fakoor, R., Ladhak, F., Nazi, A., Huber, M. (2013) Using deep learning to enhance cancer diagnosis and classification. In: 2018 IEEE International Conference on System, Computation, Automation and Networking (icscan). IEEE
|
[21] |
Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G. S.,
CrossRef
Pubmed
Google scholar
|
[22] |
Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, M. Jr and Haussler, D. (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA, 97, 262–267
CrossRef
Pubmed
Google scholar
|
[23] |
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A.,
CrossRef
Pubmed
Google scholar
|
[24] |
Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M. and Yakhini, Z. (2000) Tissue classification with gene expression profiles. J. Comput. Biol., 7, 559–583
CrossRef
Pubmed
Google scholar
|
[25] |
Danaee, P., Ghaeini, R. and Hendrix, D. A. (2017) A deep learning approach for cancer detection and relevant gene identification. Pac. Symp. Biocomput., 22, 219–229
CrossRef
Pubmed
Google scholar
|
[26] |
Wang, Y., Tetko, I. V., Hall, M. A., Frank, E., Facius, A., Mayer, K. F. and Mewes, H. W. (2005) Gene selection from microarray data for cancer classification‒a machine learning approach. Comput. Biol. Chem., 29, 37–46
CrossRef
Pubmed
Google scholar
|
[27] |
Liang, Y., Liu, C., Luan, X. Z., Leung, K. S., Chan, T. M., Xu, Z. B. and Zhang, H. (2013) Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinformatics, 14, 198
CrossRef
Pubmed
Google scholar
|
[28] |
Zeng, Z., Vo, A. H., Mao, C., Clare, S. E., Khan, S. A. and Luo, Y. (2019) Cancer classification and pathway discovery using non-negative matrix factorization. J. Biomed. Inform., 96, 103247
CrossRef
Pubmed
Google scholar
|
[29] |
Milanez-Almeida, P., Martins, A. J., Germain, R. N. and Tsang, J. S. (2020) Cancer prognosis with shallow tumor RNA sequencing. Nat. Med., 26, 188–192
CrossRef
Pubmed
Google scholar
|
[30] |
Moran, S., Martínez-Cardús, A., Sayols, S., Musulén, E., Balañá, C., Estival-Gonzalez, A., Moutinho, C., Heyn, H., Diaz-Lagares, A., de Moura, M. C.,
CrossRef
Pubmed
Google scholar
|
[31] |
Marquard, A. M., Birkbak, N. J., Thomas, C. E., Favero, F., Krzystanek, M., Lefebvre, C., Ferté, C., Jamal-Hanjani, M., Wilson, G. A., Shafi, S.,
CrossRef
Pubmed
Google scholar
|
[32] |
Jiao, W., Atwal, G., Polak, P., Karlic, R., Cuppen, E., Danyi, A., de Ridder, J., van Herpen, C., Lolkema, M. P., Steeghs, N.,
CrossRef
Pubmed
Google scholar
|
[33] |
Zhang, C., Ma, Y. (2012) Ensemble Machine Learning: Methods and Applications. New York: Springer-Verlag
|
[34] |
Tan, A. C. and Gilbert, D. (2003) Ensemble machine learning on gene expression data for cancer classification. Appl. Bioinformatics, 2, S75–S83
Pubmed
|
[35] |
Chalmers, Z. R., Connelly, C. F., Fabrizio, D., Gay, L., Ali, S. M., Ennis, R., Schrock, A., Campbell, B., Shlien, A., Chmielecki, J.,
CrossRef
Pubmed
Google scholar
|
[36] |
Ceccarelli, M., Barthel, F. P., Malta, T. M., Sabedot, T. S., Salama, S. R., Murray, B. A., Morozova, O., Newton, Y., Radenbaugh, A., Pagnotta, S. M.,
CrossRef
Pubmed
Google scholar
|
[37] |
Risbridger, G. P., Davis, I. D., Birrell, S. N. and Tilley, W. D. (2010) Breast and prostate cancer: more similar than different. Nat. Rev. Cancer, 10, 205–212
CrossRef
Pubmed
Google scholar
|
[38] |
Long, M. D. and Campbell, M. J. (2015) Pan-cancer analyses of the nuclear receptor superfamily. Nucl. Receptor Res., 2, 2
CrossRef
Pubmed
Google scholar
|
[39] |
Alexandrov, L. B., Ju, Y. S., Haase, K., Van Loo, P., Martincorena, I., Nik-Zainal, S., Totoki, Y., Fujimoto, A., Nakagawa, H., Shibata, T.,
CrossRef
Pubmed
Google scholar
|
[40] |
Hartl, D. L. and Clark, A. G. (2007) Principles of Population Genetics. Sunderland: Sinauer Associates
|
[41] |
Bailey, M. H., Tokheim, C., Porta-Pardo, E., Sengupta, S., Bertrand, D., Weerasinghe, A., Colaprico, A., Wendl, M. C., Kim, J., Reardon, B.,
CrossRef
Pubmed
Google scholar
|
[42] |
Lee, K., Jeong, H. O., Lee, S. and Jeong, W. K. (2019) CPEM: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network. Sci. Rep., 9, 16927
|
[43] |
ESMO Guidelines Task Force. (2005) ESMO Minimum Clinical Recommendations for diagnosis, treatment and follow-up of cancers of unknown primary site (CUP). Ann. Oncol., 16, i75–i76
CrossRef
Pubmed
Google scholar
|
[44] |
Mnatsakanyan, E., Tung, W. C., Caine, B. and Smith-Gagen, J. (2014) Cancer of unknown primary: time trends in incidence, United States. Cancer Causes Control, 25, 747–757
CrossRef
Pubmed
Google scholar
|
[45] |
Pavlidis, N., Khaled, H. and Gaafar, R. (2015) A mini review on cancer of unknown primary site: A clinical puzzle for the oncologists. J. Adv. Res., 6, 375–382
CrossRef
Pubmed
Google scholar
|
[46] |
Sänger, N., Effenberger, K. E., Riethdorf, S., Van Haasteren, V., Gauwerky, J., Wiegratz, I., Strebhardt, K., Kaufmann, M. and Pantel, K. (2011) Disseminated tumor cells in the bone marrow of patients with ductal carcinoma in situ. Int. J. Cancer, 129, 2522–2526
CrossRef
Pubmed
Google scholar
|
[47] |
Hosseini, H., Obradović, M. M. S., Hoffmann, M., Harper, K. L., Sosa, M. S., Werner-Klein, M., Nanduri, L. K., Werno, C., Ehrl, C., Maneck, M.,
CrossRef
Pubmed
Google scholar
|
[48] |
Rhim, A. D., Mirek, E. T., Aiello, N. M., Maitra, A., Bailey, J. M., McAllister, F., Reichert, M., Beatty, G. L., Rustgi, A. K., Vonderheide, R. H.,
CrossRef
Pubmed
Google scholar
|
[49] |
Hüsemann, Y., Geigl, J. B., Schubert, F., Musiani, P., Meyer, M., Burghart, E., Forni, G., Eils, R., Fehm, T., Riethmüller, G.,
CrossRef
Pubmed
Google scholar
|
[50] |
Svensson, C. M., Hübler, R. and Figge, M. T. (2015) Automated classification of circulating tumor cells and the impact of interobsever variability on classifier training and performance. J. Immunol. Res., 2015, 573165
CrossRef
Pubmed
Google scholar
|
[51] |
Lannin, T. B., Thege, F. I. and Kirby, B. J. (2016) Comparison and optimization of machine learning methods for automated classification of circulating tumor cells. Cytometry A, 89, 922–931
CrossRef
Pubmed
Google scholar
|
[52] |
Goodman, A. M., Kato, S., Bazhenova, L., Patel, S. P., Frampton, G. M., Miller, V., Stephens, P. J., Daniels, G. A. and Kurzrock, R. (2017) Tumor mutational burden as an independent predictor of response to immunotherapy in diverse cancers. Mol. Cancer Ther., 16, 2598–2608
CrossRef
Pubmed
Google scholar
|
[53] |
Samstein, R. M., Lee, C. H., Shoushtari, A. N., Hellmann, M. D., Shen, R., Janjigian, Y. Y., Barron, D. A., Zehir, A., Jordan, E. J., Omuro, A.,
CrossRef
Pubmed
Google scholar
|
[54] |
Ellrott, K., Bailey, M. H., Saksena, G., Covington, K. R., Kandoth, C., Stewart, C., Hess, J., Ma, S., Chiotti, K. E., McLellan, M.,
CrossRef
Pubmed
Google scholar
|
[55] |
Cortes, C. and Vapnik, V. (1995) Support-vector networks. Mach. Learn., 20, 273–297
CrossRef
Google scholar
|
[56] |
Li, A., Zhang, J. and Zhou, Z. (2014) PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinformatics, 15, 311
CrossRef
Pubmed
Google scholar
|
[57] |
Breiman, L. (2001) Random forests. Mach. Learn., 45, 5–32
CrossRef
Google scholar
|
[58] |
Chen, T. and Guestrin, C. (2016) XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.785–794
|
[59] |
Ting, F. F. and Sim, K. S. (2017) Self-regulated multilayer perceptron neural network for breast cancer classification. In: 2017 International Conference on Robotics, Automation and Sciences (Icoras)
|
/
〈 | 〉 |