Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation

Yawei Li, Yuan Luo

PDF(801 KB)
PDF(801 KB)
Quant. Biol. ›› 2020, Vol. 8 ›› Issue (4) : 347-358. DOI: 10.1007/s40484-020-0226-1
RESEARCH ARTICLE
RESEARCH ARTICLE

Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation

Author information +
History +

Abstract

Background: With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis.

Methods: We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier’s weight multiplied by its predicted probability.

Results: Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy.

Conclusion: This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification.

Graphical abstract

Keywords

cancer type classification / ensemble method / performance-weighted-voting model / linear regression / single-nucleotide polymorphism

Cite this article

Download citation ▾
Yawei Li, Yuan Luo. Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation. Quant. Biol., 2020, 8(4): 347‒358 https://doi.org/10.1007/s40484-020-0226-1

References

[1]
Vogelstein, B. and Kinzler, K. W. (2004) Cancer genes and the pathways they control. Nat. Med., 10, 789–799
CrossRef Pubmed Google scholar
[2]
Knudson, A. G. (2002) Cancer genetics. Am. J. Med. Genet., 111, 96–102
CrossRef Pubmed Google scholar
[3]
Ling, S., Hu, Z., Yang, Z., Yang, F., Li, Y., Lin, P., Chen, K., Dong, L., Cao, L., Tao, Y., (2015) Extremely high genetic diversity in a single tumor points to prevalence of non-Darwinian cell evolution. Proc. Natl. Acad. Sci. USA, 112, E6496–E6505
CrossRef Pubmed Google scholar
[4]
Zhang, Y., Li, Y., Li, T., Shen, X., Zhu, T., Tao, Y., Li, X., Wang, D., Ma, Q., Hu, Z., (2019) Genetic load and potential mutational meltdown in cancer cell populations. Mol. Biol. Evol., 36, 541–552
CrossRef Pubmed Google scholar
[5]
Bozic, I., Antal, T., Ohtsuki, H., Carter, H., Kim, D., Chen, S., Karchin, R., Kinzler, K. W., Vogelstein, B. and Nowak, M. A. (2010) Accumulation of driver and passenger mutations during tumor progression. Proc. Natl. Acad. Sci. USA, 107, 18545–18550
CrossRef Pubmed Google scholar
[6]
Hu, Z., Ding, J., Ma, Z., Sun, R., Seoane, J. A., Scott Shaffer, J., Suarez, C. J., Berghoff, A. S., Cremolini, C., Falcone, A., (2019) Quantitative evidence for early metastatic seeding in colorectal cancer. Nat. Genet., 51, 1113–1122
CrossRef Pubmed Google scholar
[7]
Yachida, S., Jones, S., Bozic, I., Antal, T., Leary, R., Fu, B., Kamiyama, M., Hruban, R. H., Eshleman, J. R., Nowak, M. A., (2010) Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature, 467, 1114–1117
CrossRef Pubmed Google scholar
[8]
Yates LR, Knappskog S, Wedge D, Farmery JHR, Gonzalez S, Martincorena I, Alexandrov LB, Van Loo P, Haugland HK, Lilleng PK, (2017) Genomic evolution of breast cancer metastasis and relapse. Cancer Cell, 32,169-84 e7
[9]
Varadhachary, G. R. and Raber, M. N. (2014) Cancer of unknown primary site. N. Engl. J. Med., 371, 757–765
CrossRef Pubmed Google scholar
[10]
Hudson, T. J., Anderson, W., Artez, A., Barker, A. D., Bell, C., Bernabé, R. R., Bhan, M. K., Calvo, F., Eerola, I., Gerhard, D. S., (2010) International network of cancer genome projects. Nature, 464, 993–998
CrossRef Pubmed Google scholar
[11]
The Cancer Genome Atlas Research N, Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C. and Stuart, J.M. (2013)The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet., 45, 1113–1120
[12]
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. (2020) Pan-cancer analysis of whole genomes. Nature, 578, 82–93
CrossRef Pubmed Google scholar
[13]
Alexandrov, L. B., Nik-Zainal, S., Wedge, D. C., Aparicio, S. A., Behjati, S., Biankin, A. V., Bignell, G. R., Bolli, N., Borg, A., Børresen-Dale, A. L., (2013) Signatures of mutational processes in human cancer. Nature, 500, 415–421
CrossRef Pubmed Google scholar
[14]
Burrell, R. A., McGranahan, N., Bartek, J. and Swanton, C. (2013) The causes and consequences of genetic heterogeneity in cancer evolution. Nature, 501, 338–345
CrossRef Pubmed Google scholar
[15]
Cicchetti, D. V. (1992) Neural networks and diagnosis in the clinical laboratory: state of the art. Clin. Chem., 38, 9–10
CrossRef Pubmed Google scholar
[16]
Cochran, A. J. (1997) Prediction of outcome for patients with cutaneous melanoma. Pigment Cell Res., 10, 162–167
CrossRef Pubmed Google scholar
[17]
Cruz, J. A. and Wishart, D. S. (2007) Applications of machine learning in cancer prediction and prognosis. Cancer Inform, 2, 59–77
Pubmed
[18]
Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V. and Fotiadis, D. I. (2015) Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J., 13, 8–17
CrossRef Pubmed Google scholar
[19]
Eraslan, G., Avsec, Ž., Gagneur, J. and Theis, F. J. (2019) Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet., 20, 389–403
CrossRef Pubmed Google scholar
[20]
Fakoor, R., Ladhak, F., Nazi, A., Huber, M. (2013) Using deep learning to enhance cancer diagnosis and classification. In: 2018 IEEE International Conference on System, Computation, Automation and Networking (icscan). IEEE
[21]
Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G. S., (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med., 8, 68–74
CrossRef Pubmed Google scholar
[22]
Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, M. Jr and Haussler, D. (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA, 97, 262–267
CrossRef Pubmed Google scholar
[23]
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537
CrossRef Pubmed Google scholar
[24]
Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M. and Yakhini, Z. (2000) Tissue classification with gene expression profiles. J. Comput. Biol., 7, 559–583
CrossRef Pubmed Google scholar
[25]
Danaee, P., Ghaeini, R. and Hendrix, D. A. (2017) A deep learning approach for cancer detection and relevant gene identification. Pac. Symp. Biocomput., 22, 219–229
CrossRef Pubmed Google scholar
[26]
Wang, Y., Tetko, I. V., Hall, M. A., Frank, E., Facius, A., Mayer, K. F. and Mewes, H. W. (2005) Gene selection from microarray data for cancer classification‒a machine learning approach. Comput. Biol. Chem., 29, 37–46
CrossRef Pubmed Google scholar
[27]
Liang, Y., Liu, C., Luan, X. Z., Leung, K. S., Chan, T. M., Xu, Z. B. and Zhang, H. (2013) Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinformatics, 14, 198
CrossRef Pubmed Google scholar
[28]
Zeng, Z., Vo, A. H., Mao, C., Clare, S. E., Khan, S. A. and Luo, Y. (2019) Cancer classification and pathway discovery using non-negative matrix factorization. J. Biomed. Inform., 96, 103247
CrossRef Pubmed Google scholar
[29]
Milanez-Almeida, P., Martins, A. J., Germain, R. N. and Tsang, J. S. (2020) Cancer prognosis with shallow tumor RNA sequencing. Nat. Med., 26, 188–192
CrossRef Pubmed Google scholar
[30]
Moran, S., Martínez-Cardús, A., Sayols, S., Musulén, E., Balañá, C., Estival-Gonzalez, A., Moutinho, C., Heyn, H., Diaz-Lagares, A., de Moura, M. C., (2016) Epigenetic profiling to classify cancer of unknown primary: a multicentre, retrospective analysis. Lancet Oncol., 17, 1386–1395
CrossRef Pubmed Google scholar
[31]
Marquard, A. M., Birkbak, N. J., Thomas, C. E., Favero, F., Krzystanek, M., Lefebvre, C., Ferté, C., Jamal-Hanjani, M., Wilson, G. A., Shafi, S., (2015) TumorTracer: a method to identify the tissue of origin from the somatic mutations of a tumor specimen. BMC Med. Genomics, 8, 58
CrossRef Pubmed Google scholar
[32]
Jiao, W., Atwal, G., Polak, P., Karlic, R., Cuppen, E., Danyi, A., de Ridder, J., van Herpen, C., Lolkema, M. P., Steeghs, N., (2020) A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns. Nat. Commun., 11, 728
CrossRef Pubmed Google scholar
[33]
Zhang, C., Ma, Y. (2012) Ensemble Machine Learning: Methods and Applications. New York: Springer-Verlag
[34]
Tan, A. C. and Gilbert, D. (2003) Ensemble machine learning on gene expression data for cancer classification. Appl. Bioinformatics, 2, S75–S83
Pubmed
[35]
Chalmers, Z. R., Connelly, C. F., Fabrizio, D., Gay, L., Ali, S. M., Ennis, R., Schrock, A., Campbell, B., Shlien, A., Chmielecki, J., (2017) Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Med., 9, 34
CrossRef Pubmed Google scholar
[36]
Ceccarelli, M., Barthel, F. P., Malta, T. M., Sabedot, T. S., Salama, S. R., Murray, B. A., Morozova, O., Newton, Y., Radenbaugh, A., Pagnotta, S. M., (2016) Molecular profiling reveals biologically discrete subsets and pathways of progression in diffuse glioma. Cell, 164, 550–563
CrossRef Pubmed Google scholar
[37]
Risbridger, G. P., Davis, I. D., Birrell, S. N. and Tilley, W. D. (2010) Breast and prostate cancer: more similar than different. Nat. Rev. Cancer, 10, 205–212
CrossRef Pubmed Google scholar
[38]
Long, M. D. and Campbell, M. J. (2015) Pan-cancer analyses of the nuclear receptor superfamily. Nucl. Receptor Res., 2, 2
CrossRef Pubmed Google scholar
[39]
Alexandrov, L. B., Ju, Y. S., Haase, K., Van Loo, P., Martincorena, I., Nik-Zainal, S., Totoki, Y., Fujimoto, A., Nakagawa, H., Shibata, T., (2016) Mutational signatures associated with tobacco smoking in human cancer. Science, 354, 618–622
CrossRef Pubmed Google scholar
[40]
Hartl, D. L. and Clark, A. G. (2007) Principles of Population Genetics. Sunderland: Sinauer Associates
[41]
Bailey, M. H., Tokheim, C., Porta-Pardo, E., Sengupta, S., Bertrand, D., Weerasinghe, A., Colaprico, A., Wendl, M. C., Kim, J., Reardon, B., (2018) Comprehensive characterization of cancer driver genes and mutations. Cell, 174, 1034–1035
CrossRef Pubmed Google scholar
[42]
Lee, K., Jeong, H. O., Lee, S. and Jeong, W. K. (2019) CPEM: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network. Sci. Rep., 9, 16927
[43]
ESMO Guidelines Task Force. (2005) ESMO Minimum Clinical Recommendations for diagnosis, treatment and follow-up of cancers of unknown primary site (CUP). Ann. Oncol., 16, i75–i76
CrossRef Pubmed Google scholar
[44]
Mnatsakanyan, E., Tung, W. C., Caine, B. and Smith-Gagen, J. (2014) Cancer of unknown primary: time trends in incidence, United States. Cancer Causes Control, 25, 747–757
CrossRef Pubmed Google scholar
[45]
Pavlidis, N., Khaled, H. and Gaafar, R. (2015) A mini review on cancer of unknown primary site: A clinical puzzle for the oncologists. J. Adv. Res., 6, 375–382
CrossRef Pubmed Google scholar
[46]
Sänger, N., Effenberger, K. E., Riethdorf, S., Van Haasteren, V., Gauwerky, J., Wiegratz, I., Strebhardt, K., Kaufmann, M. and Pantel, K. (2011) Disseminated tumor cells in the bone marrow of patients with ductal carcinoma in situ. Int. J. Cancer, 129, 2522–2526
CrossRef Pubmed Google scholar
[47]
Hosseini, H., Obradović, M. M. S., Hoffmann, M., Harper, K. L., Sosa, M. S., Werner-Klein, M., Nanduri, L. K., Werno, C., Ehrl, C., Maneck, M., (2016) Early dissemination seeds metastasis in breast cancer. Nature, 540, 552–558
CrossRef Pubmed Google scholar
[48]
Rhim, A. D., Mirek, E. T., Aiello, N. M., Maitra, A., Bailey, J. M., McAllister, F., Reichert, M., Beatty, G. L., Rustgi, A. K., Vonderheide, R. H., (2012) EMT and dissemination precede pancreatic tumor formation. Cell, 148, 349–361
CrossRef Pubmed Google scholar
[49]
Hüsemann, Y., Geigl, J. B., Schubert, F., Musiani, P., Meyer, M., Burghart, E., Forni, G., Eils, R., Fehm, T., Riethmüller, G., (2008) Systemic spread is an early step in breast cancer. Cancer Cell, 13, 58–68
CrossRef Pubmed Google scholar
[50]
Svensson, C. M., Hübler, R. and Figge, M. T. (2015) Automated classification of circulating tumor cells and the impact of interobsever variability on classifier training and performance. J. Immunol. Res., 2015, 573165
CrossRef Pubmed Google scholar
[51]
Lannin, T. B., Thege, F. I. and Kirby, B. J. (2016) Comparison and optimization of machine learning methods for automated classification of circulating tumor cells. Cytometry A, 89, 922–931
CrossRef Pubmed Google scholar
[52]
Goodman, A. M., Kato, S., Bazhenova, L., Patel, S. P., Frampton, G. M., Miller, V., Stephens, P. J., Daniels, G. A. and Kurzrock, R. (2017) Tumor mutational burden as an independent predictor of response to immunotherapy in diverse cancers. Mol. Cancer Ther., 16, 2598–2608
CrossRef Pubmed Google scholar
[53]
Samstein, R. M., Lee, C. H., Shoushtari, A. N., Hellmann, M. D., Shen, R., Janjigian, Y. Y., Barron, D. A., Zehir, A., Jordan, E. J., Omuro, A., (2019) Tumor mutational load predicts survival after immunotherapy across multiple cancer types. Nat. Genet., 51, 202–206
CrossRef Pubmed Google scholar
[54]
Ellrott, K., Bailey, M. H., Saksena, G., Covington, K. R., Kandoth, C., Stewart, C., Hess, J., Ma, S., Chiotti, K. E., McLellan, M., (2018) Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst., 6, 271–281.e7
CrossRef Pubmed Google scholar
[55]
Cortes, C. and Vapnik, V. (1995) Support-vector networks. Mach. Learn., 20, 273–297
CrossRef Google scholar
[56]
Li, A., Zhang, J. and Zhou, Z. (2014) PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinformatics, 15, 311
CrossRef Pubmed Google scholar
[57]
Breiman, L. (2001) Random forests. Mach. Learn., 45, 5–32
CrossRef Google scholar
[58]
Chen, T. and Guestrin, C. (2016) XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.785–794
[59]
Ting, F. F. and Sim, K. S. (2017) Self-regulated multilayer perceptron neural network for breast cancer classification. In: 2017 International Conference on Robotics, Automation and Sciences (Icoras)

DATA AVAILABILITY

The TCGA MC3 Public MAF file and the txt file are available at https://gdc.cancer.gov/about-data/publications/pancan-driver. The codes of the models used in this study are available online at GitHub (https://github.com/duckliyawei/performance-weighted-voting).

SUPPLEMENTARY MATERIALS

The supplementary materials can be found online with this article at https://10.1007/s40484-020-0226-1.

ACKNOWLEDGEMENTS

We thank Chengsheng Mao for the comments and suggestions during the preparation of the manuscript. We thank Xin Wu for their helps in the artwork of figures. This study is supported in part by NIH grant R21LM012618.

COMPLIANCE WITH ETHICS GUIDELINES

The authors Yawei Li and Yuan Luo declare that they have no conflict of interests.
This article does not contain any study materials with human or animal subjects performed by any of the authors.

RIGHTS & PERMISSIONS

2020 Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature
AI Summary AI Mindmap
PDF(801 KB)

Accesses

Citations

Detail

Sections
Recommended

/