An effective encoding of human medical conditions in disease space provides a versatile framework for deciphering disease associations

Tianxin Xu , Yu Li , Xin Gao , Andrey Rzhetsky , Gengjie Jia

Quant. Biol. ›› 2025, Vol. 13 ›› Issue (3) : e93

PDF (623KB)
Quant. Biol. ›› 2025, Vol. 13 ›› Issue (3) : e93 DOI: 10.1002/qub2.93
PERSPECTIVE

An effective encoding of human medical conditions in disease space provides a versatile framework for deciphering disease associations

Author information +
History +
PDF (623KB)

Abstract

It is challenging to identify comorbidity patterns and mechanistically investigate disease associations based on health-related data that are often sparse, large-scale, and multimodal. Adopting a systems biology approach, embedding-based algorithms provide a new perspective to examine diseases under a unified framework by mapping diseases into a high-dimensional space as embedding vectors. These vectors and their constituted disease space encode pathological information and enable a quantitative and systemic measurement of the similarity between any pair of diseases, opening up an avenue for numerous types of downstream analyses. Here, we exemplify its potential through applications in discovering hidden disease associations, assisting in genetic parameter estimation, facilitating data-driven disease classifications, and transforming genetic association studies of diseases in consideration of comorbidities. While underscoring the power and versatility of this approach, we also discuss the challenges posed by medical context, requirements of online training and result validation, and research opportunities in constructing foundation models from multimodal disease data. With continued innovation and exploration, disease embedding has the potential to transform the fields of disease association analysis and even pathology studies by providing a holistic representation of patient health status.

Keywords

biomedical data mining / disease embedding / machine learning

Cite this article

Download citation ▾
Tianxin Xu, Yu Li, Xin Gao, Andrey Rzhetsky, Gengjie Jia. An effective encoding of human medical conditions in disease space provides a versatile framework for deciphering disease associations. Quant. Biol., 2025, 13(3): e93 DOI:10.1002/qub2.93

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Banerjee J, Taroni JN, Allaway RJ, Prasad DV, Guinney J, Greene C. Machine learning in rare disease. Nat Methods. 2023; 20 (6): 803- 14.

[2]

Allen N, Sudlow C, Downey P, Peakman T, Danesh J, Elliott P, et al. UK Biobank: current status and what it means for epidemiology. Health Policy Technol. 2012; 1 (3): 123- 6.

[3]

Chen Z, Chen J, Collins R, Guo Y, Peto R, Wu F, et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int J Epidemiol. 2011; 40 (6): 1652- 66.

[4]

Adamson DM, Chang S, Hansen LG. Health research data for the real world: the MarketScan databases. New York: Thompson Healthcare; 2008. p. b28.

[5]

Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020; 581 (7809): 434- 43.

[6]

Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, et al. A global reference for human genetic variation. Nature. 2015; 526 (7571): 68- 74.

[7]

Bick AG, Metcalf GA, Mayo KR, Lichtenstein L, Rura S, Carroll RJ, et al. Genomic data in the all of us research program. Nature. 2024; 627 (8003): 340- 6.

[8]

Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program. Nature. 2021; 590 (7845): 290- 9.

[9]

Choi Y, Chiu CYI, Sontag D. Learning low-dimensional representations of medical concepts. AMIA Jt Summits Transl Sci Proc. 2016; 2016: 41- 50.

[10]

Hinton GE. Learning distributed representations of concepts. In: Proceedings of the eighth annual conference of the cognitive science society, 1; 1986. p. 12.

[11]

Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst. 2013; 26.

[12]

Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); 2014; p. 1532- 43.

[13]

Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 2, short papers. Valencia, Spain; 2017. p. 427- 31.

[14]

Devlin J, Changm MW, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT. 2019; 1.

[15]

Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1(long papers). New Orleans, Louisiana; 2018. p. 2227- 37.

[16]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in neural information processing systems, 30; Nat Methods. 2017.

[17]

Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997; 9 (8): 1735- 80.

[18]

Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D, Rasmy L, et al. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digit Med. 2021; 4 (1): 86.

[19]

Li Y, Rao S, Solares JRA, Hassaine A, Ramakrishnan R, Canoy D, et al. BEHRT: Transformer for electronic health records. Sci Rep. 2020; 10 (1): 7155.

[20]

Yun T, Cosentino J, Behsaz B, McCaw ZR, Hill D, Luben R, et al. Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction. Nat Genet. 2024; 56: 1604- 13.

[21]

Saab K, Tu T, Weng W-H, Tanno R, Stutz D, Wulczyn E, et al. Capabilities of Gemini models in medicine. 2024. Preprint at arXiv:240418416.

[22]

Jia G, Li Y, Zhang H, Chattopadhyay I, Boeck Jensen A, Blair DR, et al. Estimating heritability and genetic correlations from large health datasets in the absence of genetic data. Nat Commun. 2019; 10 (1): 5508.

[23]

Jia G, Li Y, Zhong X, Wang K, Pividori M, Alomairy R, et al. The high-dimensional space of human diseases built from diagnosis records and mapped to genetic loci. Nat Comput Sci. 2023; 3 (5): 403- 17.

[24]

Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. 2013. Preprint at arXiv: 13013781.

[25]

Řehůřek R, Sojka P. Gensim—statistical semantics in python. 2011.

[26]

Dong G, Zhang Z-C, Feng J, Zhao X-M. MorbidGCN: prediction of multimorbidity with a graph convolutional network based on integration of population phenotypes and disease network. Briefings Bioinf. 2022; 23 (4): bbac255.

[27]

Nielsen RL, Monfeuga T, Kitchen RR, Egerod L, Leal LG, Schreyer ATH, et al. Data-driven identification of predictive risk biomarkers for subgroups of osteoarthritis using interpretable machine learning. Nat Commun. 2024; 15 (1): 2817.

[28]

Daniali M, Galer PD, Lewis-Smith D, Parthasarathy S, Kim E, Salvucci DD, et al. Enriching representation learning using 53 million patient notes through human phenotype ontology embedding. Artif Intell Med. 2023; 139: 102523.

[29]

Loscalzo J, Kohane I, Barabasi A-L. Human disease classification in the postgenomic era: a complex systems approach to human pathobiology. Mol Syst Biol. 2007; 3 (1): 124.

[30]

Jia G, Zhong X, Im HK, Schoettler N, Pividori M, Hogarth DK, et al. Discerning asthma endotypes through comorbidity mapping. Nat Commun. 2022; 13 (1): 13.

[31]

Chong W, John P, David MB. Online variational inference for the hierarchical Dirichlet process. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, 15; 2011. p. 752- 60.

[32]

Wang K, Gaitsch H, Poon H, Cox NJ, Rzhetsky A. Classification of common human diseases derived from shared genetic and environmental determinants. Nat Genet. 2017; 49 (9): 1319- 25.

[33]

Jiang J, Wang H, Xie J, Guo X, Guan Y, Yu Q. Medical knowledge embedding based on recursive neural network for multi-disease diagnosis. Artif Intell Med. 2020; 103: 101772.

[34]

Boomsma D, Busjahn A, Peltonen L. Classical twin studies and beyond. Nat Rev Genet. 2002; 3 (11): 872- 82.

[35]

Falconer DS. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Ann Hum Genet. 1965; 29 (1): 51- 76.

[36]

Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, Sullivan PF, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009; 460 (7256): 748- 52.

[37]

Tylee DS, Sun J, Hess JL, Tahir MA, Sharma E, Malik R, et al. Genetic correlations among psychiatric and immune-related phenotypes based on genome-wide association data. Am J Med Genet Pt B. 2018; 177 (7): 641- 57.

[38]

Choy CT, Wong CH, Chan SL. Embedding of genes using cancer gene expression data: biological relevance and potential application on biomarker discovery. Front Genet. 2019; 9: 682.

[39]

Ngo DL, Yamamoto N, Tran VA, Nguyen NG, Phan D, Lumbanraja FR, et al. Application of word embedding to drug repositioning. J Biomed Sci Eng. 2016; 9 (1): 7- 16.

[40]

Alachram H, Chereda H, Beißbarth T, Wingender E, Stegmaier P. Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks. PLoS One. 2021; 16 (10): e0258623.

[41]

Patel K, Xie Z, Yuan H, Islam SMS, Xie Y, He W, et al. Unsupervised deep representation learning enables phenotype discovery for genetic association studies of brain imaging. Commun Biol. 2024; 7 (1): 414.

[42]

Mougin F, Bodenreider O, Burgun A. Analyzing polysemous concepts from a clinical perspective: application to auditing concept categorization in the UMLS. J Biomed Inf. 2009; 42 (3): 440- 51.

[43]

Schlack R, Peerenboom N, Neuperdt L, Junker S, Beyer AK. The effects of mental health problems in childhood and adolescence in young adults: results of the KiGGS cohort. Journal of health monitoring. 2021; 6 (4): 3.

[44]

Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004; 32 (90001): D267- 70.

[45]

Lipscomb CE. Medical subject headings (MeSH). Bull Med Libr Assoc. 2000; 88: 265- 6.

[46]

Vuokko R, Vakkuri A, Palojoki S. Systematized nomenclature of medicine-clinical terminology (SNOMED CT) clinical use cases in the context of electronic health record systems: systematic literature review. JMIR Med Inform. 2023; 11: e43750.

[47]

Liu S, Ma W, Moore R, Ganesan V, Nelson S. RxNorm: prescription for electronic drug information exchange. IT Prof. 2005; 7 (5): 17- 23.

[48]

Gomes HM, Read J, Bifet A, Barddal JP, Gama J. Machine learning for streaming data: state of the art, challenges, and opportunities. SIGKDD Explor Newsl. 2019; 21 (2): 6- 22.

[49]

Hazan E, Rakhlin A, Bartlett P. Adaptive online gradient descent. In: Advances in neural information processing systems, 20; 2007.

[50]

Crammer K, Kulesza A, Dredze M. Adaptive regularization of weight vectors. Adv Neural Inf Process Syst. 2009; 22.

[51]

Hazan E. Introduction to online convex optimization. Found Trends Optim. 2016; 2 (3-4): 157- 325.

[52]

Losing V, Hammer B, Wersing H. Incremental on-line learning: a review and comparison of state of the art algorithms. Neurocomputing. 2018; 275: 1261- 74.

[53]

Ben-David S, Blitzer J, Crammer K, Pereira F. Analysis of representations for domain adaptation. In: Advances in neural information processing systems, 19; 2006.

[54]

Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA. Deep reinforcement learning: a brief survey. IEEE Signal Process Mag. 2017; 34 (6): 26- 38.

[55]

Bamler R, Mandt S. Dynamic word embeddings. In: Proceedings of the 34th international conference on machine learning, 70; 2017; p. 380- 9.

[56]

Menche J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, et al. Uncovering disease-disease relationships through the incomplete interactome. Science. 2015; 347 (6224): 1257601.

[57]

Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023; 616 (7956): 259- 65.

[58]

Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the opportunities and risks of foundation models. 2021. Preprint at arXiv:210807258.

[59]

Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2019: btz682.

[60]

Qian J, Jin Z, Zhang Q, Cai G, Liu B. A liver cancer question-answering system based on next-generation intelligence and the large model Med-PaLM 2. Int J Comput Sci Inf Technol. 2024; 2 (1): 28- 35.

[61]

Kanakarajan KR, Kundumani B, Sankarasubbu M. BioELECTRA: pretrained biomedical text encoder using discriminators. In: Proceedings of the 20th workshop on biomedical language processing; 2021: p. 143- 54.

[62]

Yuan H, Yuan Z, Gan R, Zhang J, Xie Y, Yu S. BioBART: pretraining and evaluation of a biomedical generative language model. In: Proceedings of the 21st workshop on biomedical language processing. 2022: p. 97- 109.

[63]

Cui C, Yang H, Wang Y, Zhao S, Asad Z, Coburn LA, et al. Deep multimodal fusion of image and non-image data in disease diagnosis and prognosis: a review. Prog Biomed Eng. 2023; 5 (2): 022001.

[64]

Salton G, Wong A, Yang CS. A vector space model for automatic indexing. Commun ACM. 1975; 18 (11): 613- 20.

[65]

Duda RO, Hart PE, Stork, DG. Pattern classification and scene analysis. New York: Wiley. 1973; 3.

[66]

Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987; 20: 53- 65.

[67]

Hubert L, Arabie P. Comparing partitions. J Classif. 1985; 2 (1): 193- 218.

RIGHTS & PERMISSIONS

The Author(s). Quantitative Biology published by John Wiley & Sons Australia, Ltd on behalf of Higher Education Press.

AI Summary AI Mindmap
PDF (623KB)

429

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/