Text Mining Approaches for Protein Function Annotation: Challenges and Opportunities

Wang Hong , Zhang Chengxin

Synth. Biol. Eng. ›› 2026, Vol. 4 ›› Issue (1) : 10022

PDF
Synth. Biol. Eng. ›› 2026, Vol. 4 ›› Issue (1) :10022 DOI: 10.70322/sbe.2025.10022
research-article
Text Mining Approaches for Protein Function Annotation: Challenges and Opportunities
Author information +
History +
PDF

Abstract

Understanding protein functions is essential for advancing quantitative synthetic biology, which applies quantitative and systems approaches to understand how biological functions emerge from building blocks, thereby guiding the rational design of complex living systems. Apart from a few model organisms, most species contain many proteins with unverified functions, highlighting the need for accurate, automated protein function annotation methods. Recent advances in protein bioinformatics, particularly in predicting structures and functions, have been driven by artificial intelligence (AI), especially deep learning models. Top-performing methods in the Critical Assessment of Function Annotation (CAFA) challenge have leveraged large language models to perform text mining-based protein function prediction, extracting features from scientific literature or using template proteins with similar descriptions in the literature. Despite these advances, several challenges remain. Current predictors often depend on PubMed abstracts curated by UniProt, leading to redundancy with manual annotations and to the overlooking of uncurated or full-text literature that contains richer functional evidence. Few systems automatically classify literature types or assess their relevance, limiting precision and interpretability. Benchmarking remains difficult due to the absence of unbiased gold standards, making it hard to evaluate true predictive capability. Furthermore, integrating heterogeneous evidence—from text, sequences, and structural or network data—presents additional challenges for model harmonization. This review not only summarizes current methods and limitations but also highlights strategies to improve text mining-based protein function annotation using recent AI developments. Overall, this work aims to guide the development of next-generation tools for more accurate and comprehensive protein function predictions.

Keywords

Proteins / Biological functions / Text mining / Gene Ontology (GO) terms / Deep learning

Cite this article

Download citation ▾
Wang Hong, Zhang Chengxin. Text Mining Approaches for Protein Function Annotation: Challenges and Opportunities. Synth. Biol. Eng., 2026, 4(1): 10022 DOI:10.70322/sbe.2025.10022

登录浏览全文

4963

注册一个新账户 忘记密码

Statement of the Use of Generative AI and AI-Assisted Technologies in the Writing Process

During the preparation of this manuscript, the author(s) used ChatGPT in order to to ameliorate the grammar, syntax and organization of the main text. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the published article.

Acknowledgments

The authors thank Xiaoqiong Wei for insightful discussions. With the editor’s permission, this review is partly based on a review written in Chinese by the author previously published in the Synthetic Biology Journal [83]. The current review is not a mere translation of the Chinese version of the review, but rather includes novel discussions on challenges in text mining-based function prediction; all figures have been redrawn to reflect new contents as well.

Author Contributions

Conceptualization, C.Z.; Writing—Original Draft Preparation, H.W.; Writing—Review & Editing, C.Z.; Visualization, H.W. and C.Z.; Supervision, C.Z.; Funding Acquisition, C.Z.

Ethics Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code to generate Figure 2 is available at https://github.com/kad-ecoli/uniprot_figure, accessed on 28 September 2025.

Funding

This work was supported by the National Key Research and Development Program of China (2025YFA0923600 to CZ).

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

[1]

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: Tool for the unification of biology. Nat. Genet. 2000, 25, 25-29. doi:10.1038/75556.

[2]

International Union of Biochemistry.Enzyme Nomenclature, 1978: Recommendations of the Nomenclature Committee of the International Union of Biochemistry on the Nomenclature and Classification of Enzymes; Academic Press: Cambridge, MA, USA, 1979.

[3]

Talapova P, Gargano M, Matentzoglu N, Coleman B, Addo-Lartey E, Anagnostopoulos A, et al. The Human Phenotype Ontology in 2024: Phenotypes around the world. Nucleic Acids Res. 2024, 52, D1333-D1346. doi:10.1093/nar/gkad1005.

[4]

The UniProt Consortium. UniProt: The universal protein knowledgebase in 2025. Nucleic Acids Res. 2025, 53, D609-D617. doi:10.1093/nar/gkae1010.

[5]

Huntley RP, Sawford T, Mutowo-Meullenet P, Shypitsyna A, Bonilla C, Martin MJ, et al. The GOA database: Gene ontology annotation updates for 2015. Nucleic Acids Res. 2015, 43, D1057-D1063. doi:10.1093/nar/gku1113.

[6]

Feldmann P, Eicher EN, Leevers SJ, Hafen E, Hughes DA. Control of growth and differentiation by Drosophila RasGAP, a homolog of p120 ras-GTPase-activating protein. Mol. Cell Biol. 1999, 19, 1928-1937. doi:10.1128/MCB.19.3.1928.

[7]

Hutchison CA III, Chuang R-Y, Noskov VN, Assad-Garcia N, Deerinck TJ, Ellisman MH, et al. Design and synthesis of a minimal bacterial genome. Science 2016, 351, aad6253. doi:10.1126/science.aad6253.

[8]

Gaudet P, Livstone MS, Lewis SE, Thomas PD. Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium. Brief. Bioinform. 2011, 12, 449-462. doi:10.1093/bib/bbr042.

[9]

Wei X, Zhang C, Freddolino L, Zhang Y. Detecting Gene Ontology misannotations using taxon-specific rate ratio comparisons. Bioinformatics 2020, 36, 4383-4388. doi:10.1093/bioinformatics/btaa548.

[10]

Martin DM, Berriman M, Barton GJ. GOtcha: A new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinform. 2004, 5, 178. doi:10.1186/1471-2105-5-178.

[11]

Conesa A, Götz S. Blast2GO: A comprehensive suite for functional analysis in plant genomics. Int. J. Plant Genom. 2008, 2008, 619832. doi:10.1155/2008/619832.

[12]

Piovesan D, Luigi Martelli P, Fariselli P, Zauli A, Rossi I, Casadio R. BAR-PLUS: The Bologna Annotation Resource Plus for functional and structural annotation of protein sequences. Nucleic Acids Res. 2011, 39, W197-W202. doi:10.1093/nar/gkr292.

[13]

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389-3402. doi:10.1093/nar/25.17.3389.

[14]

Wass MN, Sternberg MJ. ConFunc—Functional annotation in the twilight zone. Bioinformatics 2008, 24, 798-806. doi:10.1093/bioinformatics/btn037.

[15]

Hawkins T, Chitale M, Luban S, Kihara D. PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data. Proteins 2009, 74, 566-582. doi:10.1002/prot.22172.

[16]

Gong Q, Ning W, Tian W. GoFDR: A sequence alignment based method for predicting protein functions. Methods 2016, 93, 3-14. doi:10.1016/j.ymeth.2015.08.009.

[17]

Mahlich Y, Steinegger M, Rost B, Bromberg Y. HFSP: High speed homology-driven function annotation of proteins. Bioinformatics 2018, 34, i304-i312. doi:10.1093/bioinformatics/bty262.

[18]

Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 2017, 35, 1026-1028. doi:10.1038/nbt.3988.

[19]

Kulmanov M, Hoehndorf R. DeepGOPlus: Improved protein function prediction from sequence. Bioinformatics 2020, 36, 422-429. doi:10.1093/bioinformatics/btz595.

[20]

Kulmanov M, Hoehndorf R. DeepGOZero: Improving protein function prediction from sequence and zero-shot learning based on ontology axioms. Bioinformatics 2022, 38, i238-i245. doi:10.1093/bioinformatics/btac256.

[21]

Yuan Q, Xie J, Xie J, Zhao H, Yang Y. Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion. Brief. Bioinform. 2023, 24, bbad117. doi:10.1093/bib/bbad117.

[22]

Buchfink B, Reuter K, Drost H-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods 2021, 18, 366-368. doi:10.1038/s41592-021-01101-x.

[23]

Zhang C, Freddolino L. A large-scale assessment of sequence database search tools for homology-based protein function prediction. Brief. Bioinform. 2024, 25, bbae349. doi:10.1093/bib/bbae349.

[24]

Zhang C, Freddolino L, Zhang Y. COFACTOR: Improved protein function prediction by combining structure, sequence and protein-protein interaction information. Nucleic Acids Res. 2017, 45, W291-W299. doi:10.1093/nar/gkx366.

[25]

Zhang C, Zheng W, Freddolino L, Zhang Y. MetaGO: Predicting Gene Ontology of non-homologous proteins through lowresolution protein structure prediction and protein-protein network mapping. J. Mol. Biol. 2018, 430, 2256-2265. doi:10.1016/j.jmb.2018.03.004.

[26]

Zhang Y, Skolnick J. TM-align: A protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005, 33, 2302-2309. doi:10.1093/nar/gki524.

[27]

Zhang C, Zhang X, Freddolino L, Zhang Y. BioLiP2: An updated structure database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2024, 52, D404-D412. doi:10.1093/nar/gkad630.

[28]

Laskowski RA.The ProFunc function prediction server. In Protein Function Prediction:Methods and Protocols; Springer:New York, NY, USA, 2017; pp. 75-95.

[29]

Krissinel E, Henrick K. Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Biol. Crystallogr. 2004, 60, 2256-2268. doi:10.1107/S0907444904026460.

[30]

Barker JA, Thornton JM. An algorithm for constraint-based structural template matching: Application to 3D templates with statistical analysis. Bioinformatics 2003, 19, 1644-1649. doi:10.1093/bioinformatics/btg226.

[31]

Zhang C, Liu Q, Freddolino L. StarFunc: Fusing template-based and deep learning approaches for accurate protein function prediction. bioRxiv 2024. doi:10.1101/2024.05.15.594113.

[32]

Van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CL, et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 2024, 42, 243-246. doi:10.1038/s41587-023-01773-0.

[33]

Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al. AlphaFold Protein Structure Database: Massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022, 50, D439-D444. doi:10.1093/nar/gkab1061.

[34]

Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer EL, et al. Pfam: The protein families database in 2021. Nucleic Acids Res 2021, 49, D412-D419. doi:10.1093/nar/gkaa913.

[35]

Liu Q, Zhang C, Freddolino L. InterLabelGO+: Unraveling label correlations in protein function prediction. Bioinformatics 2024, 40, btae655. doi:10.1093/bioinformatics/btae655.

[36]

Gligorijević V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 2021, 12, 3168. doi:10.1038/s41467-021-23303-9.

[37]

Ma W, Zhang S, Li Z, Jiang M, Wang S, Lu W, et al. Enhancing protein function prediction performance by utilizing AlphaFold-predicted protein structures. J. Chem. Inf. Model. 2022, 62, 4008-4017. doi:10.1021/acs.jcim.2c00885.

[38]

Qiu X-Y, Wu H, Shao J. TALE-cmap: Protein function prediction based on a TALE-based architecture and the structure information from contact map. Comput. Biol. Med. 2022, 149, 105938. doi:10.1016/j.compbiomed.2022.105938.

[39]

Yang Y, Jerger A, Feng S, Wang Z, Brasfield C, Cheung MS, et al. Improved enzyme functional annotation prediction using contrastive learning with structural inference. Commun. Biol. 2024, 7, 1690. doi:10.1038/s42003-024-07359-z.

[40]

Lan L, Djuric N, Guo Y, Vucetic S. MS-kNN: Protein function prediction by integrating multiple data sources. BMC Bioinform. 2013, 14, S8. doi:10.1186/1471-2105-14-S3-S8.

[41]

Piovesan D, Tosatto SC. INGA 2.0: Improving protein function prediction for the dark proteome. Nucleic Acids Res. 2019, 47, W373-W378. doi:10.1093/nar/gkz375.

[42]

You R, Zhang Z, Xiong Y, Sun F, Mamitsuka H, Zhu S. GOLabeler: Improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 2018, 34, 2465-2473. doi:10.1093/bioinformatics/bty130.

[43]

Blum M, Chang H-Y, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 2021, 49, D344-D354. doi:10.1093/nar/gkaa977.

[44]

Chen T, Guestrin C. Xgboost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13-17 August 2016.

[45]

You R, Yao S, Xiong Y, Huang X, Sun F, Mamitsuka H, et al. NetGO: Improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 2019, 47, W379-W387. doi:10.1093/nar/gkz388.

[46]

Yao S, You R, Wang S, Xiong Y, Huang X, Zhu S. NetGO 2.0: Improving large-scale protein function prediction with massive sequence, text, domain, family and network information. Nucleic Acids Res. 2021, 49, W469-W475. doi:10.1093/nar/gkab398.

[47]

Kulmanov M, Khan MA, Hoehndorf R. DeepGO: Predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 2018, 34, 660-668. doi:10.1093/bioinformatics/btx624.

[48]

Sanderson T, Bileschi ML, Belanger D, Colwell LJ. ProteInfer, deep neural networks for protein functional inference. Elife 2023, 12, e80942. doi:10.7554/eLife.80942.

[49]

Ryu JY, Kim HU, Lee SY. Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers. Proc. Natl. Acad. Sci. USA 2019, 116, 13996-14001. doi:10.1073/pnas.1821905116.

[50]

Han S-R, Park M, Kosaraju S, Lee J, Lee H, Lee JH, et al. Evidential deep learning for trustworthy prediction of enzyme commission number. Brief. Bioinform. 2024, 25, bbad401. doi:10.1093/bib/bbad401.

[51]

Zhu Y-H, Zhang C, Yu D-J, Zhang Y. Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction. PLoS Comput. Biol. 2022, 18, e1010793. doi:10.1371/journal.pcbi.1010793.

[52]

Kulmanov M, Guzmán-Vega FJ, Roggli PD, Lane L, Arold ST, Hoehndorf R. Deepgo-se: Protein function prediction as approximate semantic entailment. bioRxiv 2023. doi:10.1101/2023.09.26.559473.

[53]

Chervov A, Vakhrushev A, Fironov S, Martignetti L. ProtBoost: Protein function prediction with Py-Boost and Graph Neural Networks—CAFA5 top2 solution. arXiv 2024, arXiv:2412.04529.

[54]

Wang W, Shuai Y, Zeng M, Fan W, Li M. DPFunc: Accurately predicting protein function via deep learning with domainguided structure information. Nat. Commun. 2025, 16, 70. doi:10.1038/s41467-024-54816-8.

[55]

Kim GB, Kim JY, Lee JA, Norsigian CJ, Palsson BO, Lee SY. Functional annotation of enzyme-encoding genes using deep learning with transformer layers. Nat. Commun. 2023, 14, 7370. doi:10.1038/s41467-023-43216-z.

[56]

Yu T, Cui H, Li JC, Luo Y, Jiang G, Zhao H. Enzyme function prediction using contrastive learning. Science 2023, 379, 1358-1363. doi:10.1126/science.adf2465.

[57]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al.Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. doi:10.48550/arXiv.1706.03762.

[58]

Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123-1130. doi:10.1126/science.ade2574.

[59]

Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7112-7127. doi:10.1109/TPAMI.2021.3095381.

[60]

Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 2013, 10, 221-227. doi:10.1038/nmeth.2340.

[61]

Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 2019, 20, 244. doi:10.1186/s13059-019-1835-8.

[62]

Jiang Y, Oron TR, Clark WT, Bankapur AR, D’Andrea D, Lepore R, et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 2016, 17, 184. doi:10.1186/s13059-016-1037-6.

[63]

Yan H, Wang S, Liu H, Mamitsuka H, Zhu S. GORetriever: Reranking protein-description-based GO candidates by literaturedriven deep information retrieval for protein function annotation. Bioinformatics 2024, 40, ii53-ii61. doi:10.1093/bioinformatics/btae401.

[64]

Chua ZM, Rajesh A, Sinha S, Adams PD. PROTGOAT: Improved automated protein function predictions using Protein Language Models. bioRxiv 2024. doi:10.1101/2024.04.01.587572.

[65]

Cozzetto D, Buchan DW, Bryson K, Jones DT. Protein function prediction by massive integration of evolutionary analyses and multiple data sources. BMC Bioinform. 2013, 14, S1. doi:10.1186/1471-2105-14-S3-S1.

[66]

You R, Huang X, Zhu S. DeepText2GO: Improving large-scale protein function prediction with deep semantic text representation. Methods 2018, 145, 82-90. doi:10.1016/j.ymeth.2018.05.026.

[67]

Le Q, Mikolov T. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning, Beijing, China, 21-26 June 2014.

[68]

Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 2021, 3, 1-23. doi:10.1145/3458754.

[69]

Cohan A, Feldman S, Beltagy I, Downey D, Weld DS. Specter: Document-level representation learning using citationinformed transformers. arXiv 2020, arXiv:2004.07180. doi:10.48550/arXiv.2004.07180.

[70]

Reimers N, Gurevych I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. doi:10.48550/arXiv.1908.10084.

[71]

Wu J, Yin Q, Zhang C, Geng J, Wu H, Hu H, et al. Function Prediction for G Protein-Coupled Receptors through Text Mining and Induction Matrix Completion. ACS Omega 2019, 4, 3045-3054. doi:10.1021/acsomega.8b02454.

[72]

Badal VD, Kundrotas PJ, Vakser IA.Text mining for protein docking. PLoS Comput. Biol. 2015, 11, e1004630. doi:10.1371/journal.pcbi.1004630.

[73]

Kafkas Ş, Hoehndorf R. Ontology based text mining of gene-phenotype associations: Application to candidate gene prediction. Database 2019, 2019, baz019. doi:10.1093/database/baz019.

[74]

Czarnecki J, Nobeli I, Smith AM, Shepherd AJ. A text-mining system for extracting metabolic reactions from full-text articles. BMC Bioinform. 2012, 13, 172. doi:10.1186/1471-2105-13-172.

[75]

Verspoor KM, Cohn JD, Ravikumar KE, Wall ME. Text mining improves prediction of protein functional sites. PLoS ONE 2012, 7, e32171. doi:10.1371/journal.pone.0032171.

[76]

Wei X, Zou S, Xie Z, Wang Z, Huang N, Cen Z, et al. EDIL3 deficiency ameliorates adverse cardiac remodelling by neutrophil extracellular traps (NET)-mediated macrophage polarization. Cardiovasc. Res. 2022, 118, 2179-2195. doi:10.1093/cvr/cvab269.

[77]

Pafilis E, Buttigieg PL, Ferrell B, Pereira E, Schnetzer J, Arvanitidis C, et al. EXTRACT: Interactive extraction of environment metadata and term suggestion for metagenomic sample annotation. Database 2016, 2016, baw005. doi:10.1093/database/baw005.

[78]

Wei C-H, Kao H-Y, Lu Z. PubTator: A web-based text mining tool for assisting biocuration. Nucleic Acids Res. 2013, 41, W518-W522. doi:10.1093/nar/gkt441.

[79]

Weber L, Sänger M, Münchmeyer J, Habibi M, Leser U, Akbik A. HunFlair: An easy-to-use tool for state-of-the-art biomedical named entity recognition. Bioinformatics 2021, 37, 2792-2794. doi:10.1093/bioinformatics/btab042.

[80]

Giorgi JM, Bader GD. Towards reliable named entity recognition in the biomedical domain. Bioinformatics 2020, 36, 280-286. doi:10.1093/bioinformatics/btz504.

[81]

Furrer L, Jancso A, Colic N, Rinaldi F.OGER++: Hybrid multi-type entity recognition. J. Cheminform 2019, 11, 7. doi:10.1186/s13321-018-0326-3.

[82]

Zhu S, Cai J, Xiong R, Zheng L, Ma D. Singular pooling: A spectral pooling paradigm for second-trimester prenatal level II ultrasound standard fetal plane identification. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 12508-12523. doi:10.1109/TCSVT.2025.3588395.

[83]

Zhang C. Challenges and opportunities in text mining-based protein function annotation. Synth. Biol. J. 2025, 6, 603-606. doi:10.12211/2096-8280.2025-002.

PDF

14

Accesses

0

Citation

Detail

Sections
Recommended

/