EnrichGT: a comprehensive R-based tool for functional genomics enrichment analysis based on large language models

Runchen Wang , Zhiming Ye , Qixia Wang , Bo Liang , Nanfei Fu , Wenxi Wang , Huimin Deng , Taimin Zhu , Shangxi Zeng , Yudong Zhang , Shunjun Jiang , Ying Huang , Wenhua Liang , Hengrui Liang , Jianxing He , Xusen Zou

Artificial Intelligence Surgery ›› 2026, Vol. 6 ›› Issue (1) : 18 -35.

PDF
Artificial Intelligence Surgery ›› 2026, Vol. 6 ›› Issue (1) :18 -35. DOI: 10.20517/ais.2025.67
Original Article

EnrichGT: a comprehensive R-based tool for functional genomics enrichment analysis based on large language models

Author information +
History +
PDF

Abstract

Aim: We aimed to develop EnrichGT, an open-source and clinician-friendly R package for functional genomics enrichment analysis leveraging large language models (LLMs). The tool addresses major limitations of existing approaches, including semantic redundancy, limited interpretability, and static reporting frameworks, thereby facilitating clinical interpretation and supporting data-driven decision-making.

Methods: EnrichGT implemented both over-representation analysis and preranked gene set enrichment analysis using multiple knowledge bases. To minimize redundancy, enriched pathways were clustered based on shared genes, emphasizing coherent biological themes. Biological interpretability is further improved by inferring transcription factor activity through CollecTRI (Collection of Transcription Regulation Interactions, https://github.com/saezlab/CollecTRI) and pathway activity via PROGENy (Pathway RespOnsive GENes for activity inference, https://saezlab.github.io/progeny/). Additionally, context-aware annotations were generated through LLM integration, and results were compiled into dynamic, interactive reports using Quarto.

Results: EnrichGT streamlines functional genomics enrichment analysis by clustering pathways based on gene co-occurrence, significantly reducing redundancy and enhancing interpretability. When applied to lung adenocarcinoma data from The Cancer Genome Atlas (TCGA), 873 enriched Gene Ontology terms were consolidated into 15 biologically coherent modules, revealing key processes such as myeloid cell activation and tumor-associated angiogenesis. Downstream analysis identified major tumor-associated regulators [CREB1 (cAMP responsive element binding protein 1), RELA/NF-κB p65 (RELA = RELA proto-oncogene, NF-κB = nuclear factor kappa-light-chain-enhancer of activated B cells signaling), HIF1A (hypoxia inducible factor 1 subunit alpha), PPARG (peroxisome proliferator activated receptor gamma), ETS1 (ETS proto-oncogene 1)] and critical signaling axes [TNFα (tumor necrosis factor alpha signaling), NF-κB, hypoxia (oxygen deprivation-related signaling)]. Automated LLM-based annotations and multi-database integration provided complementary pathway insights. Furthermore, EnrichGT’s comparative multi-condition framework revealed conserved and condition-specific biological patterns across datasets, including single-cell ear-canal development and TCGA tumor-stage progression. Its dynamic reporting interface ensured transparent, reproducible, and iterative exploration of enrichment results.

Conclusion: EnrichGT offered a robust, clinician-friendly solution for functional genomics enrichment analysis, enhancing clinical interpretation and data-driven decision-making.

Keywords

Enrichment analysis / large language models / visualization / EnrichGT

Cite this article

Download citation ▾
Runchen Wang, Zhiming Ye, Qixia Wang, Bo Liang, Nanfei Fu, Wenxi Wang, Huimin Deng, Taimin Zhu, Shangxi Zeng, Yudong Zhang, Shunjun Jiang, Ying Huang, Wenhua Liang, Hengrui Liang, Jianxing He, Xusen Zou. EnrichGT: a comprehensive R-based tool for functional genomics enrichment analysis based on large language models. Artificial Intelligence Surgery, 2026, 6(1): 18-35 DOI:10.20517/ais.2025.67

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Vitorino R.Transforming clinical research: the power of high-throughput omics integration.Proteomes2024;12:25 PMCID:PMC11417901

[2]

Zeng ISL.Review of statistical learning methods in integrated omics studies (an integrated information science).Bioinform Biol Insights2018;12:1177932218759292 PMCID:PMC5824897

[3]

Subramanian A,Mootha VK.Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.Proc Natl Acad Sci U S A2005;102:15545-50 PMCID:PMC1239896

[4]

Ontology Consortium. The Gene Ontology resource: enriching a GOld mine.Nucleic Acids Res2021;49:D325-34 PMCID:PMC7779012

[5]

Sun D,Zhang XS.CEA: combination-based gene set functional enrichment analysis.Sci Rep2018;8:13085 PMCID:PMC6117355

[6]

Chicco D.Nine quick tips for pathway enrichment analysis.PLoS Comput Biol2022;18:e1010348 PMCID:PMC9371296

[7]

Li W,Freudenberg-Hua Y,Yang Y.Beyond standard pipeline and p < 0.05 in pathway enrichment analyses.Comput Biol Chem2021;92:107455 PMCID:PMC9179938

[8]

Wu T,Xu S.clusterProfiler 4.0: a universal enrichment tool for interpreting omics data.Innovation2021;2:100141 PMCID:PMC8454663

[9]

Zhou Y,Pache L.Metascape provides a biologist-oriented resource for the analysis of systems-level datasets.Nat Commun2019;10:1523 PMCID:PMC6447622

[10]

Gentleman RC,Bates DM.Bioconductor: open software development for computational biology and bioinformatics.Genome Biol2004;5:R80 PMCID:PMC545600

[11]

Huber W,Gentleman R.Orchestrating high-throughput genomic analysis with Bioconductor.Nat Methods2015;12:115-21 PMCID:PMC4509590

[12]

Milacic M,Conley P.The Reactome Pathway Knowledgebase 2024.Nucleic Acids Res2024;52:D672-8 PMCID:PMC10767911

[13]

Kanehisa M.KEGG: kyoto encyclopedia of genes and genomes.Nucleic Acids Res2000;28:27-30 PMCID:PMC102409

[14]

Benjamini Y.Controlling the false discovery rate: a practical and powerful approach to multiple testing.J R Stat Soc Ser B Stat Methodol1995;57:289-300

[15]

Korotkevich G,Budin N,Artyomov MN. Fast gene set enrichment analysis. bioRxiv 2016;bioRxiv:060012. Available from https://doi.org/10.1101/060012 [accessed 18 December 2025].

[16]

gt: easily create presentation-ready display tables. Available from https://gt.rstudio.com [accessed 18 December 2025].

[17]

Schubert M,Klünemann M.Perturbation-response genes reveal signaling footprints in cancer gene expression.Nat Commun2018;9:20 PMCID:PMC5750219

[18]

Müller-Dott S,Vazquez M.Expanding the coverage of regulons from high-confidence prior knowledge for accurate estimation of transcription factor activities.Nucleic Acids Res2023;51:10934-49 PMCID:PMC10639077

[19]

Kim JY,Jung H.Genome-wide methylation patterns predict clinical benefit of immunotherapy in lung cancer.Clin Epigenetics2020;12:119 PMCID:PMC7410160

[20]

Tsimberidou AM,Vo HH,Johnson A.Molecular tumour boards - current and future considerations for precision oncology.Nat Rev Clin Oncol2023;20:843-63

[21]

Llorente S.Implementation of privacy and security for a genomic information system based on standards.J Pers Med2022;12:915 PMCID:PMC9224945

[22]

Filkins BL,Roberts B.Privacy and security in the era of digital health: what should translational researchers know and do about it?.Am J Transl Res2016;8:1560-80 PMCID:PMC4859641

[23]

Dennis G Jr,Hosack DA.DAVID: database for annotation, visualization, and integrated discovery.Genome Biol2003;4:P3

[24]

Liu Y.Empowering biologists to decode omics data: the Genekitr R package and web server.BMC Bioinformatics2023;24:214 PMCID:PMC10205030

[25]

Sharma V,Winning L,Crowe M.Protocol for developing a dashboard for interactive cohort analysis of oral health-related data.BMC Oral Health2023;23:238 PMCID:PMC10124053

[26]

Schmidt CO,Enzenbach C.Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R.BMC Med Res Methodol2021;21:63 PMCID:PMC8019177

[27]

Lassmann T,Weeks A.A flexible computational pipeline for research analyses of unsolved clinical exome cases.NPJ Genom Med2020;5:54 PMCID:PMC7730424

[28]

Lan W,Liu M.Deep imputation bi-stochastic graph regularized matrix factorization for clustering single-cell RNA-sequencing data.IEEE/ACM Trans Comput Biol Bioinform2024:1-11

[29]

Zeng Y,Shangguan N.CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells.Nat Commun2025;16:4679 PMCID:PMC12092794

[30]

Strantz C,Ganslandt T.Empowering personalized oncology: evolution of digital support and visualization tools for molecular tumor boards.BMC Med Inform Decis Mak2025;25:29 PMCID:PMC11736948

PDF

101

Accesses

0

Citation

Detail

Sections
Recommended

/