Characterizing diseases using genetic and clinical variables: A data analytics approach

Madhuri Gollapalli; Harsh Anand; Satish Mahadevan Srinivasan

doi:10.1002/qub2.46

Quant. Biol. ›› 2024, Vol. 12 ›› Issue (3) :271 -285. DOI: 10.1002/qub2.46

RESEARCH ARTICLE

Characterizing diseases using genetic and clinical variables: A data analytics approach

Author information +

History +

PDF (892KB)

Abstract

Predictive analytics is crucial in precision medicine for personalized patient care. To aid in precision medicine, this study identifies a subset of genetic and clinical variables that can serve as predictors for classifying diseased tissues/disease types. To achieve this, experiments were performed on diseased tissues obtained from the L1000 dataset to assess differences in the functionality and predictive capabilities of genetic and clinical variables. In this study, the k‐means technique was used for clustering the diseased tissue types, and the multinomial logistic regression (MLR) technique was applied for classifying the diseased tissue types. Dimensionality reduction techniques including principal component analysis and Boruta are used extensively to reduce the dimensionality of genetic and clinical variables. The results showed that landmark genes performed slightly better in clustering diseased tissue types compared to any random set of 978 non‐landmark genes, and the difference is statistically significant. Furthermore, it was evident that both clinical and genetic variables were important in predicting the diseased tissue types. The top three clinical predictors for predicting diseased tissue types were identified as morphology, gender, and age of diagnosis. Additionally, this study explored the possibility of using the latent representations of the clusters of landmark and non‐landmark genes as predictors for an MLR classifier. The classification models built using MLR revealed that landmark genes can serve as a subset of genetic variables and/or as a proxy for clinical variables. This study concludes that combining predictive analytics with dimensionality reduction effectively identifies key predictors in precision medicine, enhancing diagnostic accuracy.

Keywords

clustering / k‐means / L1000 dataset analysis / landmark genes / multinomial logistic regression / non‐landmark genes / principal component analysis / tissue classification

Cite this article

Download citation ▾

Madhuri Gollapalli, Harsh Anand, Satish Mahadevan Srinivasan. Characterizing diseases using genetic and clinical variables: A data analytics approach. Quant. Biol., 2024, 12(3): 271-285 DOI:10.1002/qub2.46

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Papatheodorou I , Oellrich A , Smedley D . Linking gene expression to phenotypes via pathway information. J Biomed Semant. 2015; 6 (1): 17.

[2]	National Human Genome Research Institute . The human genome project. Available from the website of Genome in NIH. Accessed: Feb. 05, 2024.

[3]	Kirby J , Heath PR , Shaw PJ , Hamdy FC . Gene expression assays. Adv Clin Chem. 2007; 44: 247- 92.

[4]	Vieira A . Genes and disease learn science at scitable. Available from the website of Nature.

[5]	Ma X , Pierce E , Anand H , Aviles N , Kunk P , Alemazkoor N . Early prediction of response to palliative chemotherapy in patients with stage-Ⅳ gastric and esophageal cancer. BMC Cancer. 2023; 23 (1): 1- 9.

[6]	Sharma R , Anand H , Badr Y , Qiu RG . Time-to-event prediction using survival analysis methods for Alzheimer’s disease progression. Alzheimer's Dement. 2021; 7 (1): e12229.

[7]	Clayman CL , Srinivasan SM , Sangwan RS . Cancer survival analysis using RNA sequencing and clinical data. Procedia Comput Sci. 2020; 168: 80- 7.

[8]	Momozawa Y , Sasai R , Usui Y , Shiraishi K , Iwasaki Y , Taniyama Y , et al. Expansion of cancer risk profile for BRCA1 and BRCA2 pathogenic variants. JAMA Oncol. 2022; 8 (6): 871- 8.

[9]	Clayman CL , Srinivasan SM , Sangwan RS . K-means clustering and principal components analysis of microarray data of L1000 landmark genes. Procedia Comput Sci. 2020; 168: 97- 104.

[10]	Al-Azzam N , Shatnawi I . Comparing supervised and semi-supervised machine learning models on diagnosing breast cancer. Annals of Medicine and Surgery. 2021; 62: 53- 64.

[11]	Wu J , Hicks C . Breast cancer type classification using machine learning. J Pers Med. 2021; 11 (2): 61.

[12]	Shukla N , Hagenbuchner M , Win KT , Yang J . Breast cancer data analysis for survivability studies and prediction. Comput Meth Progr Biomed. 2018; 155: 199- 208.

[13]	Duan Q , Reid SP , Clark NR , Wang Z , Fernandez NF , Rouillard AD , et al. L1000CDS2: LINCS L1000 characteristic direction signatures search engine. NPJ Syst Biol Appl. 2016; 2 (1): 1- 12.

[14]	Bageritz J , Willnow P , Valentini E , Leible S , Boutros M , Teleman AA . Gene expression atlas of a developing tissue by single cell expression correlation analysis. Nat Methods. 2019; 16 (8): 750- 6.

[15]	Odhiambo Omuya E , Onyango Okeyo G , Waema Kimwele M . Feature selection for classification using principal component analysis and information Gain. Expert Syst Appl. 2021; 174: 114765.

[16]	Jamal A , Handayani A , Septiandri AA , Ripmiatin E , Effendi Y . Dimensionality reduction using PCA and K-means clustering for breast cancer prediction. Lontar Komput. 2018; 192: 192.

[17]	Reddy GT , Reddy MPK , Lakshmanna K , Kaluri R , Rajput DS , Srivastava G , et al. Analysis of dimensionality reduction techniques on big data. IEEE Access. 2020; 8: 54776- 88.

[18]	Manhar MA , Soesanti I , Setiawan NA . A improving feature selection on heart disease dataset with Boruta approach. J FORTEI-JEERI. 2020; 1 (1): 41- 8.

[19]	Chen X , Xie J , Yuan Q . A method to facilitate cancer detection and type classification from gene expression data using a deep autoencoder and neural network; 2018. Preprint at arXiv:1812.08674.

[20]	Rendleman MC , Buatti JM , Braun TA , Smith BJ , Nwakama C , Beichel RR , et al. Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality. BMC Bioinf. 2019; 20 (1): 1- 9.

[21]	Dinesh KG , Arumugaraj K , Santhosh KD , Mareeswari V . Prediction of cardiovascular disease using machine learning algorithms. In: Proceedings of the 2018 International Conference on Current Trends towards Converging Technologies, ICCTCT 2018; 2018.

[22]	Su N , Visscher C , van Wijk A , Lobbezoo F , van der Heijden G . A prediction model for types of treatment indicated for patients with temporomandibular disorders. J Oral Facial Pain Headache. 2019; 33 (1): 25- 38.

[23]	Seok HS . Enhancing performance of gene expression value prediction with cluster-based regression. Genes Genom. 2021; 43 (9): 1059- 64.

[24]	Petralia F , Song WM , Tu Z , Wang P . New method for joint network analysis reveals common and different coexpression patterns among genes and proteins in breast cancer. J Proteome Res. 2016; 15 (3): 743- 54.

[25]	Liang J , Hou L , Luan Z , Huang W . Feature selection with conditional mutual information considering feature interaction. Symmetry (Basel). 2019; 11 (7): 858.

[26]	Cong Y , Shintani M , Imanari F , Osada N , Endo T . A new approach to drug repurposing with two-stage prediction, machine learning, and unsupervised clustering of gene expression. OMICS. 2022; 26 (6): 339- 47.

[27]	Wang Y , Tang S , Zhang L , Bu X , Lu L , Li H , et al. Data-driven clustering differentiates subtypes of major depressive disorder with distinct brain connectivity and symptom features. Br J Psychiatr. 2021; 219 (5): 606- 13.

[28]	Joel D , Persico A , Salhov M , Berman Z , Oligschläger S , Meilijson I , et al. Analysis of human brain structure reveals that the brain ‘types’ typical of males are also typical of females, and vice versa. Front Hum Neurosci. 2018; 12: 399.

[29]	Bailey MH , Tokheim C , Porta-Pardo E , Sengupta S , Bertrand D , Weerasinghe A , et al. Comprehensive characterization of cancer driver genes and mutations. Cell. 2018; 173 (2): 371- 85.e18.

[30]	Danaee P , Ghaeini R , Hendrix DA . A deep learning approach for cancer detection and relevant gene identification. In: Pacific Symposium on Biocomputing. 2017; 0 (212679): 219- 29.

[31]	Duncan R , Carpenter B , Main LC , Telfer C , Murray GI . Characterisation and protein expression profiling of annexins in colorectal cancer. Br J Cancer. 2007; 98 (2): 426- 33.

[32]	Huang S , Cai N , Pacheco PP , Narrandes S , Wang Y , Xu W . Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genom Proteom. 2018; 15 (1).

[33]	Liang M , Li Z , Chen T , Zeng J . Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach. IEEE/ACM Trans Comput Biol Bioinform. 2015; 12 (4): 928- 37.

[34]	Saltz J , Gupta R , Hou L , Kurc T , Singh P , Nguyen V , et al. Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell Rep. 2018; 23 (1): 181- 93.e7.

[35]	Way GP , Zietz M , Rubinetti V , Himmelstein DS , Greene CS . Sequential compression of gene expression across dimensionalities and methods reveals no single best method or dimensionality. 2019. Preprint at bioRxiv: 573782.

[36]	Creighton CJ . Making use of cancer genomic databases. Curr Protoc Mol Biol. 2018; 2018 (1): 19141- 191413.

[37]	Kong L , Chen Y , Xu F , Xu M , Li Z , Fang J , et al. Mining influential genes based on deep learning. BMC Bioinf. 2021; 22 (1): 1- 12.

[38]	Subramanian A , Narayan R , Corsello SM , Peck DD , Natoli TE , Lu X , et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell. 2017; 171 (6): 1437- 52.e17.

[39]	Tripathi YM , Chatla SB , Chang YCI , Huang LS , Shieh GS . A nonlinear correlation measure with applications to gene expression data. PLoS One. 2022; 17 (6): e0270270.

[40]	Mills‐Tettey G , et al. The dynamic Hungarian algorithm for the assignment problem with changing costs. 2007. ri.cmu.edu. Available from the website of Carnegie Mellon University. Accessed: 05 Feb 2024.

[41]	Chen Y , Li Y , Narayan R , Subramanian A , Xie X . Gene expression inference with deep learning. Bioinformatics. 2016; 32 (12): 1832- 9.

[42]	Dangeti P . Statistics for machine learning. 2017. Packt Publishing.

[43]	Saxena A , Prasad M , Gupta A , Bharill N , Patel OP , Tiwari A , et al. A review of clustering techniques and developments. Neurocomputing. 2017; 267: 664- 81.

[44]	Anand H , Nateghi R , Alemazkoor N . Bottom‐up forecasting: applications and limitations in load forecasting using smart‐meter data. Data‐Centric Eng. 2023; 4.

[45]	Estupiñán Ricardo J , et al. K‐means clustering. 44:2021. Available from the website of people.sc.fsu.edu. Accessed: Feb. 05, 2024.

[46]	Wold S , Esbensen K , Geladi P . Principal component analysis. Chemometr Intell Lab Syst. 1987; 2 (1): 37- 52.

[47]	Kursa MB , Jankowski A , Rudnicki WR . Boruta - a system for feature selection. Fundam Inform. 2010; 101 (4): 271- 85.

[48]	Liu X‐Y , Wu J , Zhou Z.‐H . Exploratory undersampling for class‐imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybernet). 2009; 39 (2): 539- 50.

[49]	Hairani H , Anggrawan A , Priyanto D . Improvement performance of the random forest method on unbalanced diabetes data classification using smote-tomek link. JOIV: Inter J Informat Visual. 2023; 7 (1): 258- 64.

[50]	Kim TK . Understanding one-way ANOVA using conceptual figures. Korean J Anesthesiol. 2017; 70 (1): 22- 6.

RIGHTS & PERMISSIONS

2024 The Authors. Quantitative Biology published by John Wiley & Sons Australia, Ltd on behalf of Higher Education Press.

PDF (892KB)

711

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Editorial board

Abstracting / indexing

Cover gallery

Contact us

Browse

Latest issue

All volumes and issues

Collections

Collections

Authors & reviewers

Online submission

Call for papers

Editorial policy

Open access

Compliance with Ethical Requirement

Guidelines for authors

Classifications via endnote

Guidelines for reviewers

Abstract

Keywords

Cite this article

References

RIGHTS & PERMISSIONS