A feature extraction framework for discovering pan-cancer driver genes based on multi-omics data

Xiaomeng Xue, Feng Li, Junliang Shang, Lingyun Dai, Daohui Ge, Qianqian Ren

PDF(765 KB)
PDF(765 KB)
Quant. Biol. ›› 2024, Vol. 12 ›› Issue (2) : 173-181. DOI: 10.1002/qub2.40
RESEARCH ARTICLE

A feature extraction framework for discovering pan-cancer driver genes based on multi-omics data

Author information +
History +

Abstract

The identification of tumor driver genes facilitates accurate cancer diagnosis and treatment, playing a key role in precision oncology, along with gene signaling, regulation, and their interaction with protein complexes. To tackle the challenge of distinguishing driver genes from a large number of genomic data, we construct a feature extraction framework for discovering pan-cancer driver genes based on multi-omics data (mutations, gene expression, copy number variants, and DNA methylation) combined with protein–protein interaction (PPI) networks. Using a network propagation algorithm, we mine functional information among nodes in the PPI network, focusing on genes with weak node information to represent specific cancer information. From these functional features, we extract distribution features of pan-cancer data, pan-cancer TOPSIS features of functional features using the ideal solution method, and SetExpan features of pan-cancer data from the gene functional features, a method to rank pan-cancer data based on the average inverse rank. These features represent the common message of pan-cancer. Finally, we use the lightGBM classification algorithm for gene prediction. Experimental results show that our method outperforms existing methods in terms of the area under the check precision-recall curve (AUPRC) and demonstrates better performance across different PPI networks. This indicates our framework’s effectiveness in predicting potential cancer genes, offering valuable insights for the diagnosis and treatment of tumors.

Keywords

cancer driver genes / feature extraction / multi-omics data / network propagation / pan-cancer

Cite this article

Download citation ▾
Xiaomeng Xue, Feng Li, Junliang Shang, Lingyun Dai, Daohui Ge, Qianqian Ren. A feature extraction framework for discovering pan-cancer driver genes based on multi-omics data. Quant. Biol., 2024, 12(2): 173‒181 https://doi.org/10.1002/qub2.40

References

[1]
Bray F , Ren J-S , Masuyer E , Ferlay J . Global estimates of cancer prevalence for 27 sites in the adult population in 2008. Int J Cancer. 2013; 132 (5): 1133- 45.
CrossRef Google scholar
[2]
Hanahan D , Weinberg RA . Hallmarks of cancer: the next generation. Cell. 2011; 144 (5): 646- 74.
CrossRef Google scholar
[3]
Dinstag G , Shamir R . Prodigy: personalized prioritization of driver genes. Bioinformatics. 2020; 36 (6): 1831- 9.
CrossRef Google scholar
[4]
Garraway LA , Lander ES . Lessons from the cancer genome. Cell. 2013; 153 (1): 17- 37.
CrossRef Google scholar
[5]
Ledford H . The cancer genome challenge. Nature. 2010; 464 (7291): 972- 4.
CrossRef Google scholar
[6]
Weinstein JN , Collisson EA , Mills GB , Shaw KRM , Ozenberger BA , Ellrott K , et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013; 45 (10): 1113- 20.
CrossRef Google scholar
[7]
Zhang J , Bajari R , Andric D , Gerthoffert F , Lepsa A , Nahal-Bose H , et al. The international cancer genome consortium data portal. Nat Biotechnol. 2019; 37 (4): 367- 9.
CrossRef Google scholar
[8]
Repana D , Nulsen J , Dressler L , Bortolomeazzi M , Venkata SK , Tourna A , et al. The network of cancer genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens. Genome Biol. 2019; 20 (1): 1.
CrossRef Google scholar
[9]
Sondka Z , Bamford S , Cole CG , Ward SA , Dunham I , Forbes SA . The cosmic cancer gene census: describing genetic dysfunction across all human cancers. Nat Rev Cancer. 2018; 18 (11): 696- 705.
CrossRef Google scholar
[10]
Guo H , Lv X , Li Y , Li M . Attention-based gcn integrates multi-omics data for breast cancer subtype classification and patient-specific gene marker identification. Brief Funct Genomics. 2023; 22 (5): 463- 74.
CrossRef Google scholar
[11]
Tamborero D , Gonzalez-Perez A , Lopez-Bigas N . Oncodriveclust: exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics. 2013; 29 (18): 2238- 44.
CrossRef Google scholar
[12]
Lawrence MS , Stojanov P , Polak P , Kryukov GV , Cibulskis K , Sivachenko A , et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013; 499 (7457): 214- 8.
CrossRef Google scholar
[13]
Tokheim CJ , Papadopoulos N , Kinzler KW , Vogelstein B , Karchin R . Evaluating the evaluation of cancer driver genes. Proc Natl Acad Sci USA. 2016; 113 (50): 14330- 5.
CrossRef Google scholar
[14]
Cowen L , Ideker T , Raphael BJ , Sharan R . Network propagation: a universal amplifier of genetic associations. Nat Rev Genet. 2017; 18 (9): 551- 62.
CrossRef Google scholar
[15]
Page L , Brin S , Motwani R , Winograd T . The pagerank citation ranking: bringing order to the web; 1998; ID: 1508503.
[16]
Leiserson MDM , Vandin F , Wu H-T , Dobson JR , Eldridge JV , Thomas JL , et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat Genet. 2015; 47 (2): 106- 14.
CrossRef Google scholar
[17]
Perozzi B , Al-Rfou R , Skiena S . Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining; 2014. p. 701- 10.
CrossRef Google scholar
[18]
Zhang S-W , Xu J-Y , Zhang T . Dgmp: identifying cancer driver genes by jointing DGCN and MLP from multi-omics genomic data. Dev Reprod Biol. 2022; 20 (5): 928- 38.
CrossRef Google scholar
[19]
Schulte-Sasse R , Budach S , Hnisz D , Marsico A . Integration of multiomics data with graph convolutional networks to identify new cancer genes and their associated molecular mechanisms. Nat Mach Intell. 2021; 3 (6): 513- 26.
CrossRef Google scholar
[20]
Pavić Z , Novoselac V . Notes on topsis method. Int J Res Eng Sci. 2013.
[21]
Chen P . Effects of the entropy weight on topsis. Expert Syst Appl. 2021; 168: 114186.
CrossRef Google scholar
[22]
Shih H-S , Shyur H-J , Lee ES . An extension of topsis for group decision making. Math Comput Model. 2007; 45 (7-8): 801- 13.
CrossRef Google scholar
[23]
Shen J , Wu Z , Lei D , Shang J , Ren X , Han J . Setexpan: corpus-based set expansion via context feature selection and rank ensemble. In: Machine learning and knowledge discovery in databases. Springer International Publishing; 2017. p. 288- 304.
CrossRef Google scholar
[24]
Chen X , Liu X . A weighted bagging lightgbm model for potential lncrna-disease association identification. In: Bio-inspired computing: theories and applications. Springer Singapore; 2018. p. 307- 14.
CrossRef Google scholar
[25]
Collier O , Stoven V , Vert J-P . Lotus: a single- and multi-task machine learning algorithm for the prediction of cancer driver genes. PLoS Comput Biol. 2019; 15 (9): e1007381.
CrossRef Google scholar
[26]
Gumpinger AC , Lage K , Horn H , Borgwardt K . Prediction of cancer driver genes through network-based moment propagation of mutation scores. Bioinformatics. 2020; 36 (Suppl_1): 508- 15.
CrossRef Google scholar
[27]
Boyd K , Eng KH , Page CD . Area under the precision-recall curve: point estimates and confidence intervals. In: Machine learning and knowledge discovery in databases. Springer Berlin Heidelberg; 2013. p. 451- 66.
CrossRef Google scholar
[28]
Ziegler A , Koenig IR . Mining data with random forests: current options for real-world applications. Wiley Interdiscip Rev Data Min Knowl Discov. 2014; 4 (1): 55- 63.
CrossRef Google scholar
[29]
Bao W , Cui Q , Chen B , Yang B . Phage_unir_lgbm: phage virion proteins classification with unirep features and lightgbm model. Comput Math Methods Med. 2022; 2022: 1- 8.
CrossRef Google scholar
[30]
Huang S , Cai N , Pacheco PP , Narandes S , Wang Y , Xu W . Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genomics Proteomics. 2018; 15: 41- 51.
CrossRef Google scholar
[31]
Kristensen VN , Lingjoerde OC , Russnes HG , Vollan HKM , Frigessi A , Borresen-Dale A-L . Principles and methods of integrative genomic analyses in cancer. Nat Rev Cancer. 2014; 14 (5): 299- 313.
CrossRef Google scholar
[32]
Xie C , Mao X , Huang J , Ding Y , Wu J , Dong S , et al. Kobas 2.0: a web server for annotation and identification of enriched pathways and diseases. Nucleic Acids Res. 2011; 39 (Suppl l_2): W316- 22.
CrossRef Google scholar
[33]
Ma T , Zhang A . Affinity network fusion and semi-supervised learning for cancer patient clustering. Methods. 2018; 145: 16- 24.
CrossRef Google scholar
[34]
Zhao W , Gu X , Chen S , Wu J , Zhou Z . Modig: integrating multi-omics and multi-dimensional gene network for cancer driver gene identification based on graph attention network model. Bioinformatics. 2022; 38 (21): 4901- 7.
CrossRef Google scholar
[35]
Shi X , Teng H , Shi L , Bi W , Wei W , Mao F , et al. Comprehensive evaluation of computational methods for predicting cancer driver genes. Briefings Bioinf. 2022; 23 (2): bbab548.
CrossRef Google scholar
[36]
Ren TY , Ye FF , Yang LH , Liu J , Wang Y . Dynamic rule activation method based on activation factor for extended belief rule-based systems. In: 2021 16th international conference on intelligent systems and knowledge engineering (ISKE); 2021. p. 82- 6.
CrossRef Google scholar
[37]
Wu H , Chen Z , Wu Y , Zhang H , Liu Q . Integrating protein-protein interaction networks and somatic mutation data to detect driver modules in pan-cancer. Interdiscipl Sci Comput Life Sci. 2022; 14 (1): 151- 67.
CrossRef Google scholar
[38]
Kamburov A , Pentchev K , Galicka H , Wierling C , Lehrach H , Herwig R . Consensuspathdb: toward a more complete picture of cell biology. Nucleic Acids Res. 2011; 39 (Suppl l_1): D712- 7.
CrossRef Google scholar
[39]
Szklarczyk D , Gable AL , Lyon D , Junge A , Wyder S , Huerta-Cepas J , et al. String v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019; 47 (D1): D607- 13.
CrossRef Google scholar
[40]
Khurana E , Fu Y , Chen J , Gerstein M . Interpretation of genomic variants using a unified biological network approach. PLoS Comput Biol. 2013; 9 (3): e1002886.
CrossRef Google scholar
[41]
Razick S , Magklaras G , Donaldson IM . Irefindex: a consolidated protein interaction database with provenance. BMC Bioinf. 2008; 9 (1): 405.
CrossRef Google scholar
[42]
Huang JK , Carlin DE , Yu MK , Zhang W , Kreisberg JF , Tamayo P , et al. Systematic evaluation of molecular networks for discovery of disease genes. Cell Systems. 2018; 6 (4): 484- 95.
CrossRef Google scholar
[43]
Wang Q , Armenia J , Zhang C , Penson AV , Reznik E , Zhang L , et al. Unifying cancer and normal rna sequencing data from different sources. Sci Data. 2018; 5 (1): 180061.
CrossRef Google scholar
[44]
Peng W , Wu R , Dai W , Ning Y , Fu X , Liu L , et al. Mirna-gene network embedding for predicting cancer driver genes. Brief Funct Genomics. 2023; 22 (4): 341- 50.
CrossRef Google scholar
[45]
McKusick VA . Mendelian inheritance in man and its online version, omim. Am J Hum Genet. 2007; 80 (4): 588- 604.
CrossRef Google scholar
[46]
Ogata H , Goto S , Sato K , Fujibuchi W , Bono H , Kanehisa M . Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 1999; 27 (1): 29- 34.
CrossRef Google scholar
[47]
Xiang J , Zhang N-R , Zhang J-S , Lv X-Y , Li M . Prgefne: predicting disease-related genes by fast network embedding. Methods. 2021; 192: 3- 12.
CrossRef Google scholar
[48]
Vanunu O , Magger O , Ruppin E , Shlomi T , Sharan R . Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol. 2010; 6 (1): e1000641.
CrossRef Google scholar
[49]
Li F , Gao L , Wang B . Detection of driver modules with rarely mutated genes in cancers. IEEE ACM Trans Comput Biol Bioinf. 2020; 17 (2): 390- 401.
CrossRef Google scholar
[50]
Zhang L-c , Li C-j , Yu Z-l . Dynamic web service selection group decision-making based on heterogeneous QOS models. J China Univ Posts Telecommun. 2012; 19 (3): 80- 90.
CrossRef Google scholar
[51]
Li Z , Luo Z , Wang Y , Fan G , Zhang J . Suitability evaluation system for the shallow geothermal energy implementation in region by entropy weight method and topsis method. Renew Energy. 2022; 184: 564- 76.
CrossRef Google scholar
[52]
Xu H , Zeng W , Zeng X , Yen GG . An evolutionary algorithm based on minkowski distance for many-objective optimization. IEEE Trans Cybern. 2019; 49 (11): 3968- 79.
CrossRef Google scholar
[53]
Chen T , Guestrin C . Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016. p. 785- 94.
CrossRef Google scholar
[54]
Rao H , Shi X , Rodrigue AK , Feng J , Xia Y , Elhoseny M , et al. Feature selection based on artificial bee colony and gradient boosting decision tree. Appl Soft Comput. 2019; 74: 634- 42.
CrossRef Google scholar
[55]
Borji A , Cheng M-M , Jiang H , Li J . Salient object detection: a benchmark. IEEE Trans Image Process. 2015; 24 (12): 5706- 22.
CrossRef Google scholar

RIGHTS & PERMISSIONS

2024 2024 The Authors. Quantitative Biology published by John Wiley & Sons Australia, Ltd on behalf of Higher Education Press.
AI Summary AI Mindmap
PDF(765 KB)

Accesses

Citations

Detail

Sections
Recommended

/