Text Mining Approaches for Protein Function Annotation: Challenges and Opportunities
Wang Hong , Zhang Chengxin
Synth. Biol. Eng. ›› 2026, Vol. 4 ›› Issue (1) : 10022
Understanding protein functions is essential for advancing quantitative synthetic biology, which applies quantitative and systems approaches to understand how biological functions emerge from building blocks, thereby guiding the rational design of complex living systems. Apart from a few model organisms, most species contain many proteins with unverified functions, highlighting the need for accurate, automated protein function annotation methods. Recent advances in protein bioinformatics, particularly in predicting structures and functions, have been driven by artificial intelligence (AI), especially deep learning models. Top-performing methods in the Critical Assessment of Function Annotation (CAFA) challenge have leveraged large language models to perform text mining-based protein function prediction, extracting features from scientific literature or using template proteins with similar descriptions in the literature. Despite these advances, several challenges remain. Current predictors often depend on PubMed abstracts curated by UniProt, leading to redundancy with manual annotations and to the overlooking of uncurated or full-text literature that contains richer functional evidence. Few systems automatically classify literature types or assess their relevance, limiting precision and interpretability. Benchmarking remains difficult due to the absence of unbiased gold standards, making it hard to evaluate true predictive capability. Furthermore, integrating heterogeneous evidence—from text, sequences, and structural or network data—presents additional challenges for model harmonization. This review not only summarizes current methods and limitations but also highlights strategies to improve text mining-based protein function annotation using recent AI developments. Overall, this work aims to guide the development of next-generation tools for more accurate and comprehensive protein function predictions.
Proteins / Biological functions / Text mining / Gene Ontology (GO) terms / Deep learning
| [1] |
|
| [2] |
International Union of Biochemistry.Enzyme Nomenclature, 1978: Recommendations of the Nomenclature Committee of the International Union of Biochemistry on the Nomenclature and Classification of Enzymes; Academic Press: Cambridge, MA, USA, 1979. |
| [3] |
|
| [4] |
The UniProt Consortium. UniProt: The universal protein knowledgebase in 2025. Nucleic Acids Res. 2025, 53, D609-D617. doi:10.1093/nar/gkae1010. |
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
|
| [51] |
|
| [52] |
|
| [53] |
|
| [54] |
|
| [55] |
|
| [56] |
|
| [57] |
|
| [58] |
|
| [59] |
|
| [60] |
|
| [61] |
|
| [62] |
|
| [63] |
|
| [64] |
|
| [65] |
|
| [66] |
|
| [67] |
|
| [68] |
|
| [69] |
|
| [70] |
|
| [71] |
|
| [72] |
|
| [73] |
|
| [74] |
|
| [75] |
|
| [76] |
|
| [77] |
|
| [78] |
|
| [79] |
|
| [80] |
|
| [81] |
|
| [82] |
|
| [83] |
|
/
| 〈 |
|
〉 |