PDF
(801KB)
Abstract
Background: With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis.
Methods: We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier’s weight multiplied by its predicted probability.
Results: Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy.
Conclusion: This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification.
Graphical abstract
Keywords
cancer type classification
/
ensemble method
/
performance-weighted-voting model
/
linear regression
/
single-nucleotide polymorphism
Cite this article
Download citation ▾
Yawei Li, Yuan Luo.
Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation.
Quant. Biol., 2020, 8(4): 347-358 DOI:10.1007/s40484-020-0226-1
| [1] |
Vogelstein, B. and Kinzler, K. W. (2004) Cancer genes and the pathways they control. Nat. Med., 10, 789–799
|
| [2] |
Knudson, A. G. (2002) Cancer genetics. Am. J. Med. Genet., 111, 96–102
|
| [3] |
Ling, S., Hu, Z., Yang, Z., Yang, F., Li, Y., Lin, P., Chen, K., Dong, L., Cao, L., Tao, Y., (2015) Extremely high genetic diversity in a single tumor points to prevalence of non-Darwinian cell evolution. Proc. Natl. Acad. Sci. USA, 112, E6496–E6505
|
| [4] |
Zhang, Y., Li, Y., Li, T., Shen, X., Zhu, T., Tao, Y., Li, X., Wang, D., Ma, Q., Hu, Z., (2019) Genetic load and potential mutational meltdown in cancer cell populations. Mol. Biol. Evol., 36, 541–552
|
| [5] |
Bozic, I., Antal, T., Ohtsuki, H., Carter, H., Kim, D., Chen, S., Karchin, R., Kinzler, K. W., Vogelstein, B. and Nowak, M. A. (2010) Accumulation of driver and passenger mutations during tumor progression. Proc. Natl. Acad. Sci. USA, 107, 18545–18550
|
| [6] |
Hu, Z., Ding, J., Ma, Z., Sun, R., Seoane, J. A., Scott Shaffer, J., Suarez, C. J., Berghoff, A. S., Cremolini, C., Falcone, A., (2019) Quantitative evidence for early metastatic seeding in colorectal cancer. Nat. Genet., 51, 1113–1122
|
| [7] |
Yachida, S., Jones, S., Bozic, I., Antal, T., Leary, R., Fu, B., Kamiyama, M., Hruban, R. H., Eshleman, J. R., Nowak, M. A., (2010) Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature, 467, 1114–1117
|
| [8] |
Yates LR, Knappskog S, Wedge D, Farmery JHR, Gonzalez S, Martincorena I, Alexandrov LB, Van Loo P, Haugland HK, Lilleng PK, (2017) Genomic evolution of breast cancer metastasis and relapse. Cancer Cell, 32,169-84 e7
|
| [9] |
Varadhachary, G. R. and Raber, M. N. (2014) Cancer of unknown primary site. N. Engl. J. Med., 371, 757–765
|
| [10] |
Hudson, T. J., Anderson, W., Artez, A., Barker, A. D., Bell, C., Bernabé R. R., Bhan, M. K., Calvo, F., Eerola, I., Gerhard, D. S., (2010) International network of cancer genome projects. Nature, 464, 993–998
|
| [11] |
The Cancer Genome Atlas Research N, Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C. and Stuart, J.M. (2013)The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet., 45, 1113–1120
|
| [12] |
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. (2020) Pan-cancer analysis of whole genomes. Nature, 578, 82–93
|
| [13] |
Alexandrov, L. B., Nik-Zainal, S., Wedge, D. C., Aparicio, S. A., Behjati, S., Biankin, A. V., Bignell, G. R., Bolli, N., Borg, A., Børresen-Dale, A. L., (2013) Signatures of mutational processes in human cancer. Nature, 500, 415–421
|
| [14] |
Burrell, R. A., McGranahan, N., Bartek, J. and Swanton, C. (2013) The causes and consequences of genetic heterogeneity in cancer evolution. Nature, 501, 338–345
|
| [15] |
Cicchetti, D. V. (1992) Neural networks and diagnosis in the clinical laboratory: state of the art. Clin. Chem., 38, 9–10
|
| [16] |
Cochran, A. J. (1997) Prediction of outcome for patients with cutaneous melanoma. Pigment Cell Res., 10, 162–167
|
| [17] |
Cruz, J. A. and Wishart, D. S. (2007) Applications of machine learning in cancer prediction and prognosis. Cancer Inform, 2, 59–77
|
| [18] |
Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V. and Fotiadis, D. I. (2015) Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J., 13, 8–17
|
| [19] |
Eraslan, G., Avsec, Ž., Gagneur, J. and Theis, F. J. (2019) Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet., 20, 389–403
|
| [20] |
Fakoor, R., Ladhak, F., Nazi, A., Huber, M. (2013) Using deep learning to enhance cancer diagnosis and classification. In: 2018 IEEE International Conference on System, Computation, Automation and Networking (icscan). IEEE
|
| [21] |
Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G. S., (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med., 8, 68–74
|
| [22] |
Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, M. Jr and Haussler, D. (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA, 97, 262–267
|
| [23] |
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537
|
| [24] |
Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M. and Yakhini, Z. (2000) Tissue classification with gene expression profiles. J. Comput. Biol., 7, 559–583
|
| [25] |
Danaee, P., Ghaeini, R. and Hendrix, D. A. (2017) A deep learning approach for cancer detection and relevant gene identification. Pac. Symp. Biocomput., 22, 219–229
|
| [26] |
Wang, Y., Tetko, I. V., Hall, M. A., Frank, E., Facius, A., Mayer, K. F. and Mewes, H. W. (2005) Gene selection from microarray data for cancer classification‒a machine learning approach. Comput. Biol. Chem., 29, 37–46
|
| [27] |
Liang, Y., Liu, C., Luan, X. Z., Leung, K. S., Chan, T. M., Xu, Z. B. and Zhang, H. (2013) Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinformatics, 14, 198
|
| [28] |
Zeng, Z., Vo, A. H., Mao, C., Clare, S. E., Khan, S. A. and Luo, Y. (2019) Cancer classification and pathway discovery using non-negative matrix factorization. J. Biomed. Inform., 96, 103247
|
| [29] |
Milanez-Almeida, P., Martins, A. J., Germain, R. N. and Tsang, J. S. (2020) Cancer prognosis with shallow tumor RNA sequencing. Nat. Med., 26, 188–192
|
| [30] |
Moran, S., Martínez-Cardús, A., Sayols, S., Musulén, E., Balañá C., Estival-Gonzalez, A., Moutinho, C., Heyn, H., Diaz-Lagares, A., de Moura, M. C., (2016) Epigenetic profiling to classify cancer of unknown primary: a multicentre, retrospective analysis. Lancet Oncol., 17, 1386–1395
|
| [31] |
Marquard, A. M., Birkbak, N. J., Thomas, C. E., Favero, F., Krzystanek, M., Lefebvre, C., Ferté C., Jamal-Hanjani, M., Wilson, G. A., Shafi, S., (2015) TumorTracer: a method to identify the tissue of origin from the somatic mutations of a tumor specimen. BMC Med. Genomics, 8, 58
|
| [32] |
Jiao, W., Atwal, G., Polak, P., Karlic, R., Cuppen, E., Danyi, A., de Ridder, J., van Herpen, C., Lolkema, M. P., Steeghs, N., (2020) A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns. Nat. Commun., 11, 728
|
| [33] |
Zhang, C., Ma, Y. (2012) Ensemble Machine Learning: Methods and Applications. New York: Springer-Verlag
|
| [34] |
Tan, A. C. and Gilbert, D. (2003) Ensemble machine learning on gene expression data for cancer classification. Appl. Bioinformatics, 2, S75–S83
|
| [35] |
Chalmers, Z. R., Connelly, C. F., Fabrizio, D., Gay, L., Ali, S. M., Ennis, R., Schrock, A., Campbell, B., Shlien, A., Chmielecki, J., (2017) Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Med., 9, 34
|
| [36] |
Ceccarelli, M., Barthel, F. P., Malta, T. M., Sabedot, T. S., Salama, S. R., Murray, B. A., Morozova, O., Newton, Y., Radenbaugh, A., Pagnotta, S. M., (2016) Molecular profiling reveals biologically discrete subsets and pathways of progression in diffuse glioma. Cell, 164, 550–563
|
| [37] |
Risbridger, G. P., Davis, I. D., Birrell, S. N. and Tilley, W. D. (2010) Breast and prostate cancer: more similar than different. Nat. Rev. Cancer, 10, 205–212
|
| [38] |
Long, M. D. and Campbell, M. J. (2015) Pan-cancer analyses of the nuclear receptor superfamily. Nucl. Receptor Res., 2, 2
|
| [39] |
Alexandrov, L. B., Ju, Y. S., Haase, K., Van Loo, P., Martincorena, I., Nik-Zainal, S., Totoki, Y., Fujimoto, A., Nakagawa, H., Shibata, T., (2016) Mutational signatures associated with tobacco smoking in human cancer. Science, 354, 618–622
|
| [40] |
Hartl, D. L. and Clark, A. G. (2007) Principles of Population Genetics. Sunderland: Sinauer Associates
|
| [41] |
Bailey, M. H., Tokheim, C., Porta-Pardo, E., Sengupta, S., Bertrand, D., Weerasinghe, A., Colaprico, A., Wendl, M. C., Kim, J., Reardon, B., (2018) Comprehensive characterization of cancer driver genes and mutations. Cell, 174, 1034–1035
|
| [42] |
Lee, K., Jeong, H. O., Lee, S. and Jeong, W. K. (2019) CPEM: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network. Sci. Rep., 9, 16927
|
| [43] |
ESMO Guidelines Task Force. (2005) ESMO Minimum Clinical Recommendations for diagnosis, treatment and follow-up of cancers of unknown primary site (CUP). Ann. Oncol., 16, i75–i76
|
| [44] |
Mnatsakanyan, E., Tung, W. C., Caine, B. and Smith-Gagen, J. (2014) Cancer of unknown primary: time trends in incidence, United States. Cancer Causes Control, 25, 747–757
|
| [45] |
Pavlidis, N., Khaled, H. and Gaafar, R. (2015) A mini review on cancer of unknown primary site: A clinical puzzle for the oncologists. J. Adv. Res., 6, 375–382
|
| [46] |
Sänger, N., Effenberger, K. E., Riethdorf, S., Van Haasteren, V., Gauwerky, J., Wiegratz, I., Strebhardt, K., Kaufmann, M. and Pantel, K. (2011) Disseminated tumor cells in the bone marrow of patients with ductal carcinoma in situ. Int. J. Cancer, 129, 2522–2526
|
| [47] |
Hosseini, H., Obradović M. M. S., Hoffmann, M., Harper, K. L., Sosa, M. S., Werner-Klein, M., Nanduri, L. K., Werno, C., Ehrl, C., Maneck, M., (2016) Early dissemination seeds metastasis in breast cancer. Nature, 540, 552–558
|
| [48] |
Rhim, A. D., Mirek, E. T., Aiello, N. M., Maitra, A., Bailey, J. M., McAllister, F., Reichert, M., Beatty, G. L., Rustgi, A. K., Vonderheide, R. H., (2012) EMT and dissemination precede pancreatic tumor formation. Cell, 148, 349–361
|
| [49] |
Hüsemann, Y., Geigl, J. B., Schubert, F., Musiani, P., Meyer, M., Burghart, E., Forni, G., Eils, R., Fehm, T., Riethmüller, G., (2008) Systemic spread is an early step in breast cancer. Cancer Cell, 13, 58–68
|
| [50] |
Svensson, C. M., Hübler, R. and Figge, M. T. (2015) Automated classification of circulating tumor cells and the impact of interobsever variability on classifier training and performance. J. Immunol. Res., 2015, 573165
|
| [51] |
Lannin, T. B., Thege, F. I. and Kirby, B. J. (2016) Comparison and optimization of machine learning methods for automated classification of circulating tumor cells. Cytometry A, 89, 922–931
|
| [52] |
Goodman, A. M., Kato, S., Bazhenova, L., Patel, S. P., Frampton, G. M., Miller, V., Stephens, P. J., Daniels, G. A. and Kurzrock, R. (2017) Tumor mutational burden as an independent predictor of response to immunotherapy in diverse cancers. Mol. Cancer Ther., 16, 2598–2608
|
| [53] |
Samstein, R. M., Lee, C. H., Shoushtari, A. N., Hellmann, M. D., Shen, R., Janjigian, Y. Y., Barron, D. A., Zehir, A., Jordan, E. J., Omuro, A., (2019) Tumor mutational load predicts survival after immunotherapy across multiple cancer types. Nat. Genet., 51, 202–206
|
| [54] |
Ellrott, K., Bailey, M. H., Saksena, G., Covington, K. R., Kandoth, C., Stewart, C., Hess, J., Ma, S., Chiotti, K. E., McLellan, M., (2018) Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst., 6, 271–281.e7
|
| [55] |
Cortes, C. and Vapnik, V. (1995) Support-vector networks. Mach. Learn., 20, 273–297
|
| [56] |
Li, A., Zhang, J. and Zhou, Z. (2014) PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinformatics, 15, 311
|
| [57] |
Breiman, L. (2001) Random forests. Mach. Learn., 45, 5–32
|
| [58] |
Chen, T. and Guestrin, C. (2016) XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.785–794
|
| [59] |
Ting, F. F. and Sim, K. S. (2017) Self-regulated multilayer perceptron neural network for breast cancer classification. In: 2017 International Conference on Robotics, Automation and Sciences (Icoras)
|
RIGHTS & PERMISSIONS
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature