
An effective encoding of human medical conditions in disease space provides a versatile framework for deciphering disease associations
Tianxin Xu, Yu Li, Xin Gao, Andrey Rzhetsky, Gengjie Jia
Quant. Biol. ›› 2025, Vol. 13 ›› Issue (3) : e93.
An effective encoding of human medical conditions in disease space provides a versatile framework for deciphering disease associations
It is challenging to identify comorbidity patterns and mechanistically investigate disease associations based on health-related data that are often sparse, large-scale, and multimodal. Adopting a systems biology approach, embedding-based algorithms provide a new perspective to examine diseases under a unified framework by mapping diseases into a high-dimensional space as embedding vectors. These vectors and their constituted disease space encode pathological information and enable a quantitative and systemic measurement of the similarity between any pair of diseases, opening up an avenue for numerous types of downstream analyses. Here, we exemplify its potential through applications in discovering hidden disease associations, assisting in genetic parameter estimation, facilitating data-driven disease classifications, and transforming genetic association studies of diseases in consideration of comorbidities. While underscoring the power and versatility of this approach, we also discuss the challenges posed by medical context, requirements of online training and result validation, and research opportunities in constructing foundation models from multimodal disease data. With continued innovation and exploration, disease embedding has the potential to transform the fields of disease association analysis and even pathology studies by providing a holistic representation of patient health status.
biomedical data mining / disease embedding / machine learning
[1] |
Banerjee J, Taroni JN, Allaway RJ, Prasad DV, Guinney J, Greene C. Machine learning in rare disease. Nat Methods. 2023; 20 (6): 803- 14.
CrossRef
Google scholar
|
[2] |
Allen N, Sudlow C, Downey P, Peakman T, Danesh J, Elliott P, et al. UK Biobank: current status and what it means for epidemiology. Health Policy Technol. 2012; 1 (3): 123- 6.
CrossRef
Google scholar
|
[3] |
Chen Z, Chen J, Collins R, Guo Y, Peto R, Wu F, et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int J Epidemiol. 2011; 40 (6): 1652- 66.
CrossRef
Google scholar
|
[4] |
Adamson DM, Chang S, Hansen LG. Health research data for the real world: the MarketScan databases. New York: Thompson Healthcare; 2008. p. b28.
|
[5] |
Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020; 581 (7809): 434- 43.
CrossRef
Google scholar
|
[6] |
Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, et al. A global reference for human genetic variation. Nature. 2015; 526 (7571): 68- 74.
|
[7] |
Bick AG, Metcalf GA, Mayo KR, Lichtenstein L, Rura S, Carroll RJ, et al. Genomic data in the all of us research program. Nature. 2024; 627 (8003): 340- 6.
|
[8] |
Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program. Nature. 2021; 590 (7845): 290- 9.
|
[9] |
Choi Y, Chiu CYI, Sontag D. Learning low-dimensional representations of medical concepts. AMIA Jt Summits Transl Sci Proc. 2016; 2016: 41- 50.
|
[10] |
Hinton GE. Learning distributed representations of concepts. In: Proceedings of the eighth annual conference of the cognitive science society, 1; 1986. p. 12.
|
[11] |
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst. 2013; 26.
|
[12] |
Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); 2014; p. 1532- 43.
CrossRef
Google scholar
|
[13] |
Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 2, short papers. Valencia, Spain; 2017. p. 427- 31.
CrossRef
Google scholar
|
[14] |
Devlin J, Changm MW, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT. 2019; 1.
|
[15] |
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1(long papers). New Orleans, Louisiana; 2018. p. 2227- 37.
CrossRef
Google scholar
|
[16] |
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in neural information processing systems, 30; Nat Methods. 2017.
|
[17] |
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997; 9 (8): 1735- 80.
CrossRef
Google scholar
|
[18] |
Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D, Rasmy L, et al. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digit Med. 2021; 4 (1): 86.
CrossRef
Google scholar
|
[19] |
Li Y, Rao S, Solares JRA, Hassaine A, Ramakrishnan R, Canoy D, et al. BEHRT: Transformer for electronic health records. Sci Rep. 2020; 10 (1): 7155.
CrossRef
Google scholar
|
[20] |
Yun T, Cosentino J, Behsaz B, McCaw ZR, Hill D, Luben R, et al. Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction. Nat Genet. 2024; 56: 1604- 13.
CrossRef
Google scholar
|
[21] |
Saab K, Tu T, Weng W-H, Tanno R, Stutz D, Wulczyn E, et al. Capabilities of Gemini models in medicine. 2024. Preprint at arXiv:240418416.
|
[22] |
Jia G, Li Y, Zhang H, Chattopadhyay I, Boeck Jensen A, Blair DR, et al. Estimating heritability and genetic correlations from large health datasets in the absence of genetic data. Nat Commun. 2019; 10 (1): 5508.
CrossRef
Google scholar
|
[23] |
Jia G, Li Y, Zhong X, Wang K, Pividori M, Alomairy R, et al. The high-dimensional space of human diseases built from diagnosis records and mapped to genetic loci. Nat Comput Sci. 2023; 3 (5): 403- 17.
CrossRef
Google scholar
|
[24] |
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. 2013. Preprint at arXiv: 13013781.
|
[25] |
Řehůřek R, Sojka P. Gensim—statistical semantics in python. 2011.
|
[26] |
Dong G, Zhang Z-C, Feng J, Zhao X-M. MorbidGCN: prediction of multimorbidity with a graph convolutional network based on integration of population phenotypes and disease network. Briefings Bioinf. 2022; 23 (4): bbac255.
CrossRef
Google scholar
|
[27] |
Nielsen RL, Monfeuga T, Kitchen RR, Egerod L, Leal LG, Schreyer ATH, et al. Data-driven identification of predictive risk biomarkers for subgroups of osteoarthritis using interpretable machine learning. Nat Commun. 2024; 15 (1): 2817.
CrossRef
Google scholar
|
[28] |
Daniali M, Galer PD, Lewis-Smith D, Parthasarathy S, Kim E, Salvucci DD, et al. Enriching representation learning using 53 million patient notes through human phenotype ontology embedding. Artif Intell Med. 2023; 139: 102523.
CrossRef
Google scholar
|
[29] |
Loscalzo J, Kohane I, Barabasi A-L. Human disease classification in the postgenomic era: a complex systems approach to human pathobiology. Mol Syst Biol. 2007; 3 (1): 124.
CrossRef
Google scholar
|
[30] |
Jia G, Zhong X, Im HK, Schoettler N, Pividori M, Hogarth DK, et al. Discerning asthma endotypes through comorbidity mapping. Nat Commun. 2022; 13 (1): 13.
CrossRef
Google scholar
|
[31] |
Chong W, John P, David MB. Online variational inference for the hierarchical Dirichlet process. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, 15; 2011. p. 752- 60.
|
[32] |
Wang K, Gaitsch H, Poon H, Cox NJ, Rzhetsky A. Classification of common human diseases derived from shared genetic and environmental determinants. Nat Genet. 2017; 49 (9): 1319- 25.
CrossRef
Google scholar
|
[33] |
Jiang J, Wang H, Xie J, Guo X, Guan Y, Yu Q. Medical knowledge embedding based on recursive neural network for multi-disease diagnosis. Artif Intell Med. 2020; 103: 101772.
CrossRef
Google scholar
|
[34] |
Boomsma D, Busjahn A, Peltonen L. Classical twin studies and beyond. Nat Rev Genet. 2002; 3 (11): 872- 82.
CrossRef
Google scholar
|
[35] |
Falconer DS. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Ann Hum Genet. 1965; 29 (1): 51- 76.
CrossRef
Google scholar
|
[36] |
Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, Sullivan PF, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009; 460 (7256): 748- 52.
CrossRef
Google scholar
|
[37] |
Tylee DS, Sun J, Hess JL, Tahir MA, Sharma E, Malik R, et al. Genetic correlations among psychiatric and immune-related phenotypes based on genome-wide association data. Am J Med Genet Pt B. 2018; 177 (7): 641- 57.
CrossRef
Google scholar
|
[38] |
Choy CT, Wong CH, Chan SL. Embedding of genes using cancer gene expression data: biological relevance and potential application on biomarker discovery. Front Genet. 2019; 9: 682.
CrossRef
Google scholar
|
[39] |
Ngo DL, Yamamoto N, Tran VA, Nguyen NG, Phan D, Lumbanraja FR, et al. Application of word embedding to drug repositioning. J Biomed Sci Eng. 2016; 9 (1): 7- 16.
CrossRef
Google scholar
|
[40] |
Alachram H, Chereda H, Beißbarth T, Wingender E, Stegmaier P. Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks. PLoS One. 2021; 16 (10): e0258623.
CrossRef
Google scholar
|
[41] |
Patel K, Xie Z, Yuan H, Islam SMS, Xie Y, He W, et al. Unsupervised deep representation learning enables phenotype discovery for genetic association studies of brain imaging. Commun Biol. 2024; 7 (1): 414.
CrossRef
Google scholar
|
[42] |
Mougin F, Bodenreider O, Burgun A. Analyzing polysemous concepts from a clinical perspective: application to auditing concept categorization in the UMLS. J Biomed Inf. 2009; 42 (3): 440- 51.
CrossRef
Google scholar
|
[43] |
Schlack R, Peerenboom N, Neuperdt L, Junker S, Beyer AK. The effects of mental health problems in childhood and adolescence in young adults: results of the KiGGS cohort. Journal of health monitoring. 2021; 6 (4): 3.
|
[44] |
Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004; 32 (90001): D267- 70.
CrossRef
Google scholar
|
[45] |
Lipscomb CE. Medical subject headings (MeSH). Bull Med Libr Assoc. 2000; 88: 265- 6.
|
[46] |
Vuokko R, Vakkuri A, Palojoki S. Systematized nomenclature of medicine-clinical terminology (SNOMED CT) clinical use cases in the context of electronic health record systems: systematic literature review. JMIR Med Inform. 2023; 11: e43750.
CrossRef
Google scholar
|
[47] |
Liu S, Ma W, Moore R, Ganesan V, Nelson S. RxNorm: prescription for electronic drug information exchange. IT Prof. 2005; 7 (5): 17- 23.
CrossRef
Google scholar
|
[48] |
Gomes HM, Read J, Bifet A, Barddal JP, Gama J. Machine learning for streaming data: state of the art, challenges, and opportunities. SIGKDD Explor Newsl. 2019; 21 (2): 6- 22.
CrossRef
Google scholar
|
[49] |
Hazan E, Rakhlin A, Bartlett P. Adaptive online gradient descent. In: Advances in neural information processing systems, 20; 2007.
|
[50] |
Crammer K, Kulesza A, Dredze M. Adaptive regularization of weight vectors. Adv Neural Inf Process Syst. 2009; 22.
|
[51] |
Hazan E. Introduction to online convex optimization. Found Trends Optim. 2016; 2 (3-4): 157- 325.
CrossRef
Google scholar
|
[52] |
Losing V, Hammer B, Wersing H. Incremental on-line learning: a review and comparison of state of the art algorithms. Neurocomputing. 2018; 275: 1261- 74.
CrossRef
Google scholar
|
[53] |
Ben-David S, Blitzer J, Crammer K, Pereira F. Analysis of representations for domain adaptation. In: Advances in neural information processing systems, 19; 2006.
CrossRef
Google scholar
|
[54] |
Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA. Deep reinforcement learning: a brief survey. IEEE Signal Process Mag. 2017; 34 (6): 26- 38.
CrossRef
Google scholar
|
[55] |
Bamler R, Mandt S. Dynamic word embeddings. In: Proceedings of the 34th international conference on machine learning, 70; 2017; p. 380- 9.
|
[56] |
Menche J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, et al. Uncovering disease-disease relationships through the incomplete interactome. Science. 2015; 347 (6224): 1257601.
CrossRef
Google scholar
|
[57] |
Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023; 616 (7956): 259- 65.
CrossRef
Google scholar
|
[58] |
Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the opportunities and risks of foundation models. 2021. Preprint at arXiv:210807258.
|
[59] |
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2019: btz682.
CrossRef
Google scholar
|
[60] |
Qian J, Jin Z, Zhang Q, Cai G, Liu B. A liver cancer question-answering system based on next-generation intelligence and the large model Med-PaLM 2. Int J Comput Sci Inf Technol. 2024; 2 (1): 28- 35.
CrossRef
Google scholar
|
[61] |
Kanakarajan KR, Kundumani B, Sankarasubbu M. BioELECTRA: pretrained biomedical text encoder using discriminators. In: Proceedings of the 20th workshop on biomedical language processing; 2021: p. 143- 54.
CrossRef
Google scholar
|
[62] |
Yuan H, Yuan Z, Gan R, Zhang J, Xie Y, Yu S. BioBART: pretraining and evaluation of a biomedical generative language model. In: Proceedings of the 21st workshop on biomedical language processing. 2022: p. 97- 109.
CrossRef
Google scholar
|
[63] |
Cui C, Yang H, Wang Y, Zhao S, Asad Z, Coburn LA, et al. Deep multimodal fusion of image and non-image data in disease diagnosis and prognosis: a review. Prog Biomed Eng. 2023; 5 (2): 022001.
CrossRef
Google scholar
|
[64] |
Salton G, Wong A, Yang CS. A vector space model for automatic indexing. Commun ACM. 1975; 18 (11): 613- 20.
CrossRef
Google scholar
|
[65] |
Duda RO, Hart PE, Stork, DG. Pattern classification and scene analysis. New York: Wiley. 1973; 3.
|
[66] |
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987; 20: 53- 65.
CrossRef
Google scholar
|
[67] |
Hubert L, Arabie P. Comparing partitions. J Classif. 1985; 2 (1): 193- 218.
CrossRef
Google scholar
|
/
〈 |
|
〉 |