PDF
(706KB)
Abstract
The evolutionary scale modeling (ESM) series is promising to revolutionize protein science and engineering through large language models (LLMs), providing a robust framework for understanding the relationships among sequences, structures, and functions of proteins. Trained on a large number of unlabeled protein sequences, ESM models are able to capture intricate patterns of mutation and conservation, yielding insights into the structural and functional properties of proteins. Despite a growing body of literature surrounding ESM, existing surveys often fail to comprehensively describe its advancements or applications in a focused manner. This survey covers the latest developments of ESM, categorizing them into techniques of using ESM and downstream applications. Approximately 100 papers are selected and analyzed, highlighting recognized and innovative studies that exemplify the impact of ESM. Furthermore, we critically discuss the strengths and limitations of ESM to envision future applications. This review provides a valuable resource for researchers seeking to explore the power of ESM models and the emerging applications of LLMs in biology and medicine.
Keywords
BERT
/
fine-tuning
/
pretraining
/
prompting
/
protein design
/
protein function
/
protein language model
/
Transformer
Cite this article
Download citation ▾
Qingyu Yang, Jiale Yu, Jie Zheng.
A survey of downstream applications of evolutionary scale modeling protein language models.
Quant. Biol., 2026, 14(1): e70013 DOI:10.1002/qub2.70013
| [1] |
Rives A , Meier J , Sercu T , Goyal S , Lin Z , Liu J , et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci. 2021; 118 (15): e2016239118.
|
| [2] |
Rao RM , Liu J , Verkuil R , Meier J , Canny J , Abbeel P , et al. MSA Transformer. In:International conference on machine learning. PMLR; 2021. p. 8844- 56.
|
| [3] |
Meier J , Rao R , Verkuil R , Liu J , Sercu T , Rives A . Language models enable zero-shot prediction of the effects of mutations on protein function. Adv Neural Inf Process Syst. 2021; 34: 29287- 303.
|
| [4] |
Hsu C , Verkuil R , Liu J , Lin Z , Z B , Sercu T , et al. Learning inverse folding from millions of predicted structures. In: International conference on machine learning. PMLR; 2022. p. 8946- 70.
|
| [5] |
Lin Z , Akin H , Rao R , Hie B , Zhu Z , Lu W , et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023; 379 (6637): 1123- 30.
|
| [6] |
Hayes T , Rao R , Akin H , Sofroniew NJ , Oktay D , Lin Z , et al. Simulating 500 million years of evolution with a language model. 2024. Preprint at bioRxiv: 2024.07.01.600583.
|
| [7] |
Qiu Y , Wei GW . Artificial intelligence-aided protein engineering:from topological data analysis to deep protein language models. Briefings Bioinf. 2023; 24 (5): bbad289.
|
| [8] |
Kim J , McFee M , Fang Q , Abdin O , Kim PM . Computational and artificial intelligence-based methods for antibody development. Trends Pharmacol Sci. 2023; 44 (3): 175- 89.
|
| [9] |
Ferruz N , Höcker B . Controllable protein design with language models. Nat Mach Intell. 2022; 4 (6): 521- 32.
|
| [10] |
Wang B , Xie Q , Pei J , Chen Z , Tiwari P , Li Z , et al. Pre-trained language models in biomedical domain: a systematic survey. ACM Comput Surv. 2023; 56 (3): 1- 52.
|
| [11] |
Xiao Y , Zhao W , Zhang J , Jin Y , Zhang H , Ren Z , et al. Protein large language models: a comprehensive survey. 2025. Preprint at arXiv: 2502.17504.
|
| [12] |
Jacob D , Chang MW , Lee K , Toutanova K . BERT: pre-training of deep bidirectional Transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers); 2019. p. 4171- 86.
|
| [13] |
Vaswani A , Shazeer N , Parmar N , Uszkoreit J , Jones L , Gomez AN , et al. Attention is all you need. In: Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in neural information processing systems, 30. Curran Associates, Inc.; 2017.
|
| [14] |
UniProt Consortium . UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019; 47 (D1): D506- 15.
|
| [15] |
Edgar RC , Batzoglou S . Multiple sequence alignment. Curr Opin Struct Biol. 2006; 16 (3): 368- 73.
|
| [16] |
Jing B , Eismann S , Suriana P , Townshend RJL , Dror R . Learning from protein structure with geometric vector perceptrons. 2020. Preprint at arXiv: 2009.01411.
|
| [17] |
Jumper J , Evans R , Pritzel A , Green T , Figurnov M , Ronneberger O , et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021; 596 (7873): 583- 9.
|
| [18] |
Ian S , Bordin N , Dawson N , Waman VP , Paul A , Scholes HM , et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 2021; 49 (D1): D266- 73.
|
| [19] |
Protein Data Bank . Protein data bank. Nat New Biol. 1971; 233 (223): 10- 1038.
|
| [20] |
Richardson L , Allen B , Baldi G , Beracochea M , Bileschi ML , Burdett T , et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 2023; 51 (D1): D753- 9.
|
| [21] |
Chen IMA , Chu K , Palaniappan K , Ratner A , Huang J , Huntemann M , et al. The IMG/M data management and analysis system v.7: content updates and new features. Nucleic Acids Res. 2023; 51 (D1): D723- 32.
|
| [22] |
Olsen TH , Boyles F , Deane CM . Observed antibody space: a diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Sci. 2022; 31 (1): 141- 6.
|
| [23] |
Brandes N , Ofer D , Peleg Y , Rappoport N , Linial M . ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022; 38 (8): 2102- 10.
|
| [24] |
Ahmed E , Heinzinger M , Dallago C , Rehawi G , Wang Y , Jones L , et al. ProtTrans:toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2021; 44 (10): 7112- 27.
|
| [25] |
Ali M , McCann B , Naik N , Keskar NS , Anand N , Eguchi RR , et al. ProGen: language modeling for protein generation. 2020. Preprint at arXiv: 2004.03497.
|
| [26] |
Nijkamp E , Ruffolo JA , Weinstein EN , Naik N , Ali M . ProGen2: exploring the boundaries of protein language models. Cell Syst. 2023; 14 (11): 968- 78.
|
| [27] |
Chen B , Cheng X , Li P , Geng YA , Gong J , Li S , et al. xTrimoPGLM: unified 100B-scale pre-trained Transformer for deciphering the language of protein. 2024. Preprint at arXiv: 2401.06199.
|
| [28] |
Xu M , Zhang Z , Lu J , Zhu Z , Zhang Y , Chang M , et al. PEER: a comprehensive and multi-task benchmark for protein sequence understanding. Adv Neural Inf Process Syst. 2022; 35: 35156- 73.
|
| [29] |
Unsal S , Atas H , Albayrak M , Turhan K , Acar AC , Dogan T . Learning functional properties of proteins with language models. Nat Mach Intell. 2022; 4 (3): 227- 45.
|
| [30] |
Livesey BJ , Marsh JA . Updated benchmarking of variant effect predictors using deep mutational scanning. Mol Syst Biol. 2023; 19 (8): e11474.
|
| [31] |
Dallago C , Mou J , Johnston KE , Wittmann BJ , Bhattacharya N , Goldman S , et al. FLIP: benchmark tasks in fitness landscape inference for proteins. 2021. Preprint at bioRxiv: 2021.11.09.467890..
|
| [32] |
Hu M , Yuan F , Yang K , Ju F , Su J , Wang H , et al. Exploring evolution-aware &-free protein language models as protein function predictors. Adv Neural Inf Process Syst. 2022; 35: 38873- 84.
|
| [33] |
Thumuluri V , Martiny HM , Armenteros JJA , Salomon J , Nielsen H , Johansen AR . NetSolP:predicting protein solubility in Escherichia coli using language models. Bioinformatics. 2022; 38 (4): 941- 6.
|
| [34] |
Su J , Han C , Zhou Y , Shan J , Zhou X , Yuan F . SaProt: protein language modeling with structure-aware vocabulary. In: The twelfth international conference on learning representations; 2024.
|
| [36] |
Lv L , Lin Z , Li H , Liu Y , Cui J , Yu-Chian Chen C , et al. ProLLaMA: a protein large language model for multi-task protein language processing. 2024. Preprint at arXiv: 2402.16445.
|
| [37] |
Sala D , Engelberger F , Mchaourab HS , Meiler J . Modeling conformational states of proteins with AlphaFold. Curr Opin Struct Biol. 2023; 81: 102645.
|
| [38] |
Meng EC , D Goddard T , Pettersen EF , Couch GS , Pearson ZJ , Morris JH , et al. UCSF ChimeraX:tools for structure building and analysis. Protein Sci. 2023; 32 (11): e4792.
|
| [39] |
Meng Q , Guo F , Tang J . Improved structure-related prediction for insufficient homologous proteins using MSA enhancement and pre-trained language model. Briefings Bioinf. 2023; 24 (4): bbad217.
|
| [40] |
Middendorf L , Eicholt LA . Random, de novo, and conserved proteins:How structure and disorder predictors perform differently. Proteins Struct Funct Bioinf. 2024; 92 (6): 757- 67.
|
| [41] |
Avraham O , Tsaban T , Ben-Aharon Z , Tsaban L , Schueler-Furman O . Protein language models can capture protein quaternary state. BMC Bioinf. 2023; 24 (1): 433.
|
| [42] |
Kulmanov M , Guzmán-Vega FJ , Duek Roggli P , Lane L , Arold ST , Hoehndorf R . Protein function prediction as approximate semantic entailment. Nat Mach Intell. 2024; 6 (2): 220- 8.
|
| [43] |
Kumar N , Du Z , Li Y . pLM4CPPs:protein language model-based predictor for cell penetrating peptides. J Chem Inf Model. 2024; 65 (3): 1128- 39.
|
| [44] |
Giovanni Iovino B , Ye Y . Protein embedding based alignment. BMC Bioinf. 2024; 25 (1): 85.
|
| [45] |
Thumuluri V , Armenteros JJA , Johansen AR , Nielsen H , Winther O . DeepLoc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic Acids Res. 2022; 50 (W1): W228- 34.
|
| [46] |
Hallgren J , Tsirigos KD , Pedersen MD , Armenteros JJA , Marcatili P , Nielsen H , et al. DeepTMHMM predicts alpha and beta transmembrane proteins using deep neural networks. 2022. Preprint at bioRxiv: 2022.04.08.487609.
|
| [47] |
Teufel F , Armenteros JJA , Johansen AR , Gíslason MH , Pihl SI , Tsirigos KD , et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol. 2022; 40 (7): 1023- 5.
|
| [48] |
Du Z , Ding X , Hsu W , Munir A , Xu Y , Li Y . pLM4ACE: a protein language model based predictor for antihypertensive peptide screening. Food Chem. 2024; 431: 137162.
|
| [49] |
Martínez-Mauricio KL , García-Jacas CR , Cordoves-Delgado G . Examining evolutionary scale modeling-derived different-dimensional embeddings in the antimicrobial peptide classification through a KNIME workflow. Protein Sci. 2024; 33 (4): e4928.
|
| [50] |
Singh Rathore A , Choudhury S , Arora A , Tijare P , Raghava GPS . ToxinPred 3.0: an improved method for predicting the toxicity of peptides. Comput Biol Med. 2024; 179: 108926.
|
| [51] |
Manfredi M , Savojardo C , Martelli PL , Casadio R . E-SNPs&GO:embedding of protein sequence and function improves the annotation of human pathogenic variants. Bioinformatics. 2022; 38 (23): 5168- 74.
|
| [52] |
Haraldson Høie M , Gade FS , Johansen JM , Würtzen C , Winther O , Nielsen M , et al. DiscoTope-3.0:improved B-cell epitope prediction using inverse folding latent representations. Front Immunol. 2024; 15: 1322712.
|
| [53] |
Redl I , Fisicaro C , Dutton O , Hoffmann F , Henderson L , Owens BMJ , et al. ADOPT:intrinsic protein disorder prediction through deep bidirectional Transformers. NAR Genom Bioinform. 2023; 5 (2): lqad041.
|
| [54] |
Koleske ML , McInnes G , Brown JEH , Thomas N , Hutchinson K , Chin MY , et al. Functional genomics of OCTN2 variants informs protein-specific variant effect predictor for Carnitine Transporter Deficiency. Proc Natl Acad Sci. 2022; 119 (46): e2210247119.
|
| [55] |
Marquet C , Heinzinger M , Olenyi T , Dallago C , Erckert K , Bernhofer M , et al. Embeddings from protein language models predict conservation and variant effects. Hum Genet. 2022; 141 (10): 1629- 47.
|
| [56] |
Luo Z , Wang R , Sun Y , Liu J , Chen Z , Zhang YJ . Interpretable feature extraction and dimensionality reduction in ESM2 for protein localization prediction. Briefings Bioinf. 2024; 25 (2): bbad534.
|
| [57] |
Haraldson Høie M , Kiehl EN , Petersen B , Nielsen M , Winther O , Nielsen H , et al. NetSurfP-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning. Nucleic Acids Res. 2022; 50 (W1): W510- 5.
|
| [58] |
Nøddeskov Clifford J , Haraldson Høie M , Deleuran S , Peters B , Nielsen M , Marcatili P . BepiPred-3.0:improved B-cell epitope prediction using protein language models. Protein Sci. 2022; 31 (12): e4497.
|
| [59] |
Gong J , Jiang L , Chen Y , Zhang Y , Li X , Ma Z , et al. THPLM: a sequence-based deep learning framework for protein stability changes prediction upon point variations using pretrained protein language model. Bioinformatics. 2023; 39 (11): btad646.
|
| [60] |
Pak MA , Dovidchenko NV , Mishra Sharma S , Ivankov DN . New mega dataset combined with deep neural network makes a progress in predicting impact of mutation on protein stability. 2023. Preprint at bioRxiv: 2022.12.31.522396.
|
| [61] |
Villegas-Morcillo A , Gomez AM , Sanchez V . An analysis of protein language model embeddings for fold prediction. Briefings Bioinf. 2022; 23 (3): bbac142.
|
| [62] |
Jin W , Sarkizova S , Chen X , Hacohen N , Uhler C . Unsupervised protein-ligand binding energy prediction via neural Euler's rotation equation. Adv Neural Inf Process Syst. 2024; 3 (6): 33514- 28.
|
| [63] |
Jin W , Chen X , Vetticaden A , Sarzikova S , Raychowdhury R , Uhler C , et al. DSMBind: SE (3) denoising score matching for unsupervised binding energy prediction and nanobody design. 2023. Preprint at bioRxiv: 2023.12.10.570461.
|
| [64] |
Jing B , Berger B , Jaakkola T . AlphaFold meets flow matching for generating protein ensembles. 2024. Preprint at arXiv: 2402.04845.
|
| [65] |
Lipman Y , Chen RTQ , Ben-Hamu H , Nickel M , Le M . Flow matching for generative modeling. 2022. Preprint at arXiv: 2210.02747.
|
| [66] |
Hsu C , Nisonoff H , Fannjiang C , Listgarten J . Learning protein fitness models from evolutionary and assay-labeled data. Nat Biotechnol. 2022; 40 (7): 1114- 22.
|
| [67] |
Abdine H , Chatzianastasis M , Bouyioukos C , Vazirgiannis M . Prot2Text:multimodal proteins function generation with GNNs and Transformers. Proc AAAI Conf Artif Intell. 2024; 38 (10): 10757- 65.
|
| [68] |
Schmirler R , Heinzinger M , Rost B . Fine-tuning protein language models boosts predictions across diverse tasks. Nat Commun. 2024; 15 (1): 7407.
|
| [69] |
Shashkova TI , Umerenkov D , Salnikov M , Strashnov PV , Konstantinova AV , Lebed I , et al. SEMA: antigen B-cell conformational epitope prediction using deep transfer learning. Front Immunol. 2022; 13: 960985.
|
| [70] |
Harmalkar A , Rao R , Xie YR , Honer J , Deisting W , Anlahr J , et al. Toward generalizable prediction of anti-body thermostability using machine learning on sequence and structure features. mAbs. 2023; 15 (1): 2163584.
|
| [71] |
Lialin V , Deshpande V , Rumshisky A . Scaling down to scale up: a guide to parameter-efficient fine-tuning. 2023. Preprint at arXiv: 2303.15647.
|
| [72] |
Han Z , Gao C , Liu J , Zhang J , Zhang SQ . Parameter-efficient fine-tuning for large models: a comprehensive survey. 2024. Preprint at arXiv: 2403.14608.
|
| [73] |
Rebuffi S-A , Bilen H , Vedaldi A . Learning multiple visual domains with residual adapters. Adv Neural Inf Process Syst. 2017; 30: 506- 16.
|
| [74] |
Houlsby N , Giurgiu A , Jastrzebski S , Morrone B , De Laroussilhe Q , Gesmundo A , et al. Parameter-efficient transfer learning for NLP. In:International conference on machine learning. PMLR. 2019. p. 2790- 9.
|
| [75] |
Lester B , Al-Rfou R , Constant N . The power of scale for parameter-efficient prompt tuning. 2021. Preprint at arXiv: 2104.08691.
|
| [76] |
Hu EJ , Shen Y , Wallis P , Allen-Zhu Z , Li Y , Wang S , et al. LoRA: low-rank adaptation of large language models. 2021. Preprint arXiv: 2106.09685.
|
| [77] |
Zeng S , Wang D , Xu D . PEFT-SP: parameter-efficient fine-tuning on large protein language models improves signal peptide prediction. 2023. Preprint at bioRxiv: 2023.11.04.565642.
|
| [78] |
Wang D , Pourmirzaei M , Abbas UL , Zeng S , Manshour N , Esmaili F , et al. S-PLM:structure-aware protein language model via contrastive learning between sequence and structure. Adv Sci. 2025; 12 (5): e2404212.
|
| [79] |
Kroll A , Ranjan S , Engqvist MKM , Lercher MJ . A general model to predict small molecule substrates of enzymes based on machine and deep learning. Nat Commun. 2023; 14 (1): 2787.
|
| [80] |
Chen T , Pertsemlidis S , Watson R , Srikar Kavirayuni V , Hsu A , Vure P , et al. PepMLM: target sequence-conditioned generation of peptide binders via masked language modeling. 2023. Preprint at arXiv: 2310.03842.
|
| [81] |
Zhou Z , Zhang L , Yu Y , Wu B , Li M , Hong L , et al. Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning. Nat Commun. 2024; 15 (1): 5566.
|
| [82] |
Vincoff S , Goel S , Kholina K , Pulugurta R , Vure P , Chatterjee P . FusOn-pLM: a fusion oncoprotein-specific language model via focused probabilistic masking. 2024. Preprint at bioRxiv: 2024.06.03.597245.
|
| [83] |
Høie M , Hummer A , Olsen T , Nielsen M , Deane C . AntiFold: improved antibody structure design using inverse folding. In: NeurIPS 2023 generative AI and biology (GenBio) workshop; 2023.
|
| [84] |
Jamali K , Käll L , Zhang R , Brown A , Kimanius D , Scheres SHW . Automated model building and protein identification in cryo-EM maps. Nature. 2024; 628 (8007): 450- 7.
|
| [85] |
Xu X , Bonvin AMJJ . DeepRank-GNN-esm: a graph neural network for scoring protein-protein models using protein language model. Bioinform Adv. 2024; 4 (1): vbad191.
|
| [86] |
Zeng Y , Wei Z , Yuan Q , Chen S , Yu W , Lu Y , et al. Identifying B-cell epitopes using AlphaFold2 predicted structures and pretrained language model. Bioinformatics. 2023; 39 (4): btad187.
|
| [87] |
Tan Y , Zhou B , Zheng L , Fan G , Hong L . Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability. 2023. Preprint at bioRxiv: 2023.12.01.569522.
|
| [88] |
Chen L , Zhang Z , Li Z , Li R , Huo R , Chen L , et al. Learning protein fitness landscapes with deep mutational scanning data from multiple sources. Cell Syst. 2023; 14 (8): 706- 21.
|
| [89] |
Yang Z , Wang Y , Ni X , Yang S . DeepDRP:prediction of intrinsically disordered regions based on integrated view deep learning architecture from Transformer-enhanced and protein information. Int J Biol Macromol. 2023; 253: 127390.
|
| [90] |
Guo H , Huo M , Zhang R , Xie P . ProteinChat: towards achieving chatGPT-like functionalities on protein 3D structures. 2023. Preprint at arXiv: 2402.09649.
|
| [91] |
Xie Z , Xu J . Deep graph learning of inter-protein contacts. Bioinformatics. 2022; 38 (4): 947- 53.
|
| [92] |
Li M , Kang L , Xiong Y , Wang YG , Fan G , Tan P , et al. SES-Net:sequence-structure feature-integrated deep learning method for data-efficient protein engineering. J Cheminf. 2023; 15 (1): 12.
|
| [93] |
Shanehsazzadeh A , Alverio J , Kasun G , Levine S , Khan JA , Chung C , et al. In vitro validated antibody design against multiple therapeutic antigens using generative inverse folding. 2023. Preprint at bioRxiv: 2023.12.08.570889.
|
| [94] |
Yang T , Li M , Zhou B , Zhong B , Zheng L , Tan P , et al. Simple, efficient, and scalable structure-aware adapter boosts protein language models. J Chem Inf Model. 2024; 64 (16): 6338- 49.
|
| [95] |
Fang Y , Jiang Y , Wei L , Ma Q , Ren Z , Yuan Q , et al. Deep-ProSite:structure-aware protein binding site prediction using ESMFold and pretrained language model. Bioinformatics. 2023; 39 (12): btad718.
|
| [96] |
Yuan Q , Tian C , Song Y , Ou P , Zhu M , Zhao H , et al. GPSFun:geometry-aware protein sequence function predictions with language models. Nucleic Acids Res. 2024: gkae381.
|
| [97] |
He K , Fan H , Wu Y , Xie S , Girshick R . Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 9729- 38.
|
| [98] |
Yu T , Cui H , Li JC , Luo Y , Jiang G , Zhao H . Enzyme function prediction using contrastive learning. Science. 2023; 379 (6639): 1358- 63.
|
| [99] |
Singh R , Sledzieski S , Bryson B , Cowen L , Berger B . Contrastive learning in protein language space predicts interactions between drugs and protein targets. Proc Natl Acad Sci. 2023; 120 (24): e2220778120.
|
| [100] |
Radford A , Kim JW , Hallacy C , Ramesh A , Goh G , Agarwal S , et al. Learning transferable visual models from natural language supervision. In:International conference on machine learning. PMLR; 2021. p. 8748- 63.
|
| [101] |
Gao B , Qiang B , Tan H , Jia Y , Ren M , Lu M , et al. DrugCLIP:contrasive protein-molecule representation learning for virtual screening. Adv Neural Inf Process Syst. 2024; 36: 44595- 614.
|
| [102] |
Bhat S , Palepu K , Yudistyra V , Hong L , Kavirayuni VS , Chen T , et al. De novo generation and prioritization of target-binding peptide motifs from sequence alone. 2023. Preprint at bioRxiv: 2023.06.26.546591.
|
| [103] |
Palepu K , Ponnapati M , Bhat S , Tysinger E , Stan T , Brixi G , et al. Design of peptide-based protein degraders via contrastive deep learning. 2022. Preprint at bioRxiv: 2022.05.23.493169.
|
| [104] |
Barton J , Jacob DG , Leem J . Enhancing antibody language models with structural information. 2024. Preprint at bioRxiv: 2023.12.12.569610.
|
| [105] |
Zhang Z , Lu J , Chenthamarakshan V , Lozano A , Das P , Tang J . Structure-informed protein language model. 2024. Preprint arXiv: 2402.05856.
|
| [106] |
Subramanian AM , Thomson M . Unexplored regions of the protein sequence-structure map revealed at scale by a library of foldtuned language models. 2025. Preprint at bioRxiv: 2023.12.22.573145.
|
| [107] |
van Kempen M , Kim SS , Tumescheit C , Mirdita M , Gilchrist CLM , Söding J , et al. Foldseek: fast and accurate protein structure search. 2022. Preprint at bioRxiv: 2022.02.07.479398.
|
| [108] |
Zheng Z , Deng Y , Xue D , Zhou Y , Ye F , Gu Q . Structure-informed language models are protein designers. In:International conference on machine learning. PMLR; 2023. p. 42317- 38.
|
| [109] |
Chen D , Hartout P , Pellizzoni P , Oliver C , Borgwardt K . Endowing protein language models with structural knowledge. 2024. Preprint at arXiv: 2401.14819.
|
| [110] |
Rao R , Meier J , Sercu T , Ovchinnikov S , Rives A . Transformer protein language models are unsupervised structure learners. 2020. Preprint at bioRxiv: 2020.12.15.422761.
|
| [111] |
Verkuil R , Kabeli O , Du Y , Wicky BIM , Milles LF , Dauparas J , et al. Language models generalize beyond natural proteins. 2022. Preprint at bioRxiv: 2022.12.21.521521.
|
| [112] |
Lin P , Yan Y , Huang SY . DeepHomo2.0:improved protein-protein contact prediction of homodimers by Transformer-enhanced deep learning. Briefings Bioinf. 2023; 24 (1): bbac499.
|
| [113] |
Lin P , Tao H , Li H , Huang SY . Protein-protein contact prediction by geometric triangle-aware protein language models. Nat Mach Intell. 2023; 5 (11): 1275- 84.
|
| [114] |
Lin P , Yan Y , Tao H , Huang SY . Deep transfer learning for inter-chain contact predictions of transmembrane protein complexes. Nat Commun. 2023; 14 (1): 4935.
|
| [115] |
Wittmann BJ , Yue Y , Arnold FH . Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 2021; 12 (11): 1026- 45.
|
| [116] |
Tran TVT , Son Hy T . Protein design by directed evolution guided by large language models. IEEE Trans Evol Comput. 2025; 29 (2): 418- 28.
|
| [117] |
Hu R , Fu L , Chen Y , Chen J , Qiao Y , Tong S . Protein engineering via Bayesian optimization-guided evolutionary algorithm and robotic experiments. Briefings Bioinf. 2023; 24 (1): bbac570.
|
| [118] |
Ren M , Yu C , Bu D , Zhang H . Accurate and robust protein sequence design with CarbonDesign. Nat Mach Intell. 2024; 6 (5): 536- 47.
|
| [119] |
Martin A , Berner C , Ovchinnikov S , Vorobieva AA . Validation of de novo designed water-soluble and transmembrane proteins by in silico folding and melting. 2023. Preprint at bioRxiv: 2023.06.06.543955.
|
| [120] |
Hie B , Candido S , Lin Z , Kabeli O , Rao R , Smetanin N , et al. A high-level programming language for generative protein design. 2022. Preprint at bioRxiv: 2022.12.21.521526.
|
| [121] |
Brandes N , Goldman G , Wang CH , Ye CJ , Ntranos V . Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet. 2023; 55 (9): 1512- 22.
|
| [122] |
Hie BL , Shanker VR , Xu D , Bruun TUJ , Weidenbacher PA , Tang S , et al. Efficient evolution of human antibodies from general protein language models. Nat Biotechnol. 2024; 42 (2): 275- 83.
|
| [123] |
Jagota M , Ye C , Albors C , Rastogi R , Koehl A , Ioannidis N , et al. Cross-protein transfer learning substantially improves disease variant prediction. Genome Biol. 2023; 24 (1): 182.
|
| [124] |
Qiu Y , Wei GW . CLADE 2.0:evolution-driven cluster learning-assisted directed evolution. J Chem Inf Model. 2022; 62 (19): 4629- 41.
|
| [125] |
Johnson SR , Fu X , Viknander S , Goldin C , Monaco S , Zelezniak A , et al. Computational scoring and experimental evaluation of enzymes generated by neural networks. Nat Biotechnol. 2024; 43 (3): 1- 10.
|
| [126] |
Wang Y , Tang H , Huang L , Pan L , Yang L , Yang H , et al. Self-play reinforcement learning guides protein engineering. Nat Mach Intell. 2023; 5 (8): 845- 60.
|
| [127] |
Shanker VR , Bruun TUJ , Hie BL , Kim PS . Inverse folding of protein complexes with a structure-informed language model enables unsupervised antibody evolution. Science. 2024; 385 (6704): 46- 53.
|
| [128] |
Alamdari S , Thakkar N , van den Berg R , Lu AX , Fusi N , Amini AP , et al. Protein generation with evolutionary diffusion:sequence is all you need. 2023. Preprint at bioRxiv: 2023.09.11.556673.
|
| [129] |
Chowdhury R , Bouatta N , Biswas S , Floristean C , Kharkar A , Roy K , et al. Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol. 2022; 40 (11): 1617- 23.
|
| [130] |
Ruffolo JA , Chu LS , Mahajan SP , Gray JJ . Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun. 2023; 14 (1): 2389.
|
| [131] |
Xu M , Yuan X , Miret S , Tang J . ProtST: multi-modality learning of protein sequences and biomedical texts. In:International conference on machine learning. PMLR; 2023. p. 38749- 67.
|
| [132] |
Wang S , You R , Liu Y , Xiong Y , Zhu S . NetGO 3.0:protein language model improves large-scale functional annotations. Genom Proteom Bioinform. 2023; 21 (2): 349- 58.
|
| [133] |
Kilgore HR , Chinn I , Mikhael PG , Mitnikov I , Van Dongen C , Zylberberg G , et al. Protein codes promote selective subcellular compartmentalization. Science. 2025; 387: 1095- 101.
|
| [134] |
Liu W , Wang Z , You R , Xie C , Wei H , Xiong Y , et al. PLMSearch:protein language model powers accurate and fast sequence search for remote homology. Nat Commun. 2024; 15 (1): 2775.
|
| [135] |
Zhang PD , Ma J , Chen T . Escaping the drug-bias trap:using debiasing design to improve interpretability and generalization of drug-target interaction prediction. 2024. Preprint at bioRxiv: 2024.09.12.612771.
|
| [136] |
Notin P , Kollasch A , Ritter D , Van Niekerk L , Paul S , Han S , et al. ProteinGym:large-scale benchmarks for protein fitness prediction and design. Adv Neural Inf Process Syst. 2024; 36: 64331- 79.
|
| [137] |
Hou J , Adhikari B , Cheng J . DeepSF:deep convolutional neural network for mapping protein sequences to folds. Bioinformatics. 2018; 34 (8): 1295- 303.
|
| [138] |
Valeriani L , Doimo D , Cuturello F , Laio A , Ansuini A , Cazzaniga A . The geometry of hidden representations of large Transformer models. Adv Neural Inf Process Syst. 2024; 36: 51234- 52.
|
| [139] |
Wu R , Ding F , Wang R , Shen R , Zhang X , Luo S , et al. High-resolution de novo structure prediction from primary sequence. 2022. Preprint at bioRxiv: 2022.07.21.500999.
|
| [140] |
Chen B , Xie Z , Qiu J , Ye Z , Xu J , Tang J . Improved the protein complex prediction with protein language models. 2022. Preprint at bioRxiv: 2022.09.15.508065.
|
| [141] |
Wang W , Peng Z , Yang J . Single-sequence protein structure prediction using supervised Transformer protein language models. Nat Comput Sci. 2022; 2 (12): 804- 14.
|
| [142] |
Fang X , Wang F , Liu L , He J , Lin D , Xiang Y , et al. A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nat Mach Intell. 2023; 5 (10): 1087- 96.
|
| [143] |
Baek M , DiMaio F , Anishchenko I , Dauparas J , Ovchinnikov S , Lee GR , et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021; 373 (6557): 871- 6.
|
| [144] |
Aubel M , Eicholt L , Bornberg-Bauer E . Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning. F1000Research. 2023; 12: 347.
|
| [145] |
Valdés-Tresanco MS , Valdés-Tresanco ME , Jiménez-Gutiérrez DE , Moreno E . Structural modeling of nanobodies: a benchmark of state-of-the-art artificial intelligence programs. Molecules. 2023; 28 (10): 3991.
|
| [146] |
Kshirsagar M , Meller A , Humphreys IR , Sledzieski S , Xu Y , Dodhia R , et al. Rapid and accurate prediction of protein homo-oligomer symmetry using Seq2Symm. Nat Commun. 2025; 16 (1): 2017.
|
| [147] |
Bairoch A . The ENZYME database in 2000. Nucleic Acids Res. 2000; 28 (1): 304- 5.
|
| [148] |
Zhang Z , Xu M , Rokkum Jamasb A , Chenthamarakshan V , Lozano A , Das P , et al. Protein representation learning by geometric structure pretraining. In: The eleventh international conference on learning representations; 2023.
|
| [149] |
Ashburner M , Ball CA , Blake JA , Botstein D , Butler H , Cherry JM , et al. Gene ontology:tool for the unification of biology. Nat Genet. 2000; 25 (1): 25- 9.
|
| [150] |
Si Y , Yan C . Improved inter-protein contact prediction using dimensional hybrid residual networks and protein language models. Briefings Bioinf. 2023; 24 (2): bbad039.
|
| [151] |
Carbery A , Buttenschoen M , Skyner R , von Delft F , Deane CM . Learnt representations of proteins can be used for accurate prediction of small molecule binding sites on experimentally determined and predicted protein structures. J Cheminf. 2024; 16 (1): 32.
|
| [152] |
Schubach M , Maass T , Nazaretyan L , Röner S , Kircher M . CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res. 2024; 52 (D1): D1143- 54.
|
| [153] |
Landrum MJ , Lee JM , Benson M , Brown G , Chen C , Chitipiralla S , et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2016; 44 (D1): D862- 8.
|
| [154] |
Hopf TA , Ingraham JB , Poelwijk FJ , Schärfe CPI , Springer M , Sander C , et al. Mutation effects predicted from sequence co-variation. Nat Biotechnol. 2017; 35 (2): 128- 35.
|
| [155] |
Qian H , Wang Y , Zhou X , Gu T , Wang H , Lyu H , et al. ESM-Ezy: a deep learning strategy for the mining of novel multicopper oxidases with superior properties. Nat Commun. 2025; 16 (1): 3274.
|
| [156] |
Yang KK , Fusi N , Lu AX . Convolutions are competitive with Transformers for protein sequence pretraining. Cell Syst. 2024; 15 (3): 286- 94.
|
| [157] |
Zhang Z , Wayment-Steele HK , Brixi G , Wang H , Peraro MD , Kern D , et al. Protein language models learn evolutionary statistics of interacting sequence motifs. 2024. Preprint at bioRxiv: 2024.01.30.577970.
|
RIGHTS & PERMISSIONS
The Author(s). Quantitative Biology published by John Wiley & Sons Australia, Ltd on behalf of Higher Education Press.