Protein engineering in the deep learning era

Bingxin Zhou , Yang Tan , Yutong Hu , Lirong Zheng , Bozitao Zhong , Liang Hong

mLife ›› 2024, Vol. 3 ›› Issue (4) : 477 -491.

PDF
mLife ›› 2024, Vol. 3 ›› Issue (4) : 477 -491. DOI: 10.1002/mlf2.12157
REVIEW

Protein engineering in the deep learning era

Author information +
History +
PDF

Abstract

Advances in deep learning have significantly aided protein engineering in addressing challenges in industrial production, healthcare, and environmental sustainability. This review frames frequently researched problems in protein understanding and engineering from the perspective of deep learning. It provides a thorough discussion of representation methods for protein sequences and structures, along with general encoding pipelines that support both pre-training and supervised learning tasks. We summarize state-of-the-art protein language models, geometric deep learning techniques, and the combination of distinct approaches to learning from multi-modal biological data. Additionally, we outline common downstream tasks and relevant benchmark datasets for training and evaluating deep learning models, focusing on satisfying the particular needs of protein engineering applications, such as identifying mutation sites and predicting properties for candidates’ virtual screening. This review offers biologists the latest tools for assisting their engineering projects while providing a clear and comprehensive guide for computer scientists to develop more powerful solutions by standardizing problem formulation and consolidating data resources. Future research can foresee a deeper integration of the communities of biology and computer science, unleashing the full potential of deep learning in protein engineering and driving new scientific breakthroughs.

Keywords

artificial intelligence / geometric deep learning / protein engineering / protein language model / synthetic biology

Cite this article

Download citation ▾
Bingxin Zhou, Yang Tan, Yutong Hu, Lirong Zheng, Bozitao Zhong, Liang Hong. Protein engineering in the deep learning era. mLife, 2024, 3(4): 477-491 DOI:10.1002/mlf2.12157

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022;50:D439–D444.

[2]

Protein Data Bank. The single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019;47:520–528.

[3]

Waterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R, et al. Swiss-model: homology modelling of protein structures and complexes. Nucleic Acids Res. 2018;46:W296–W303.

[4]

Zhou B, Zheng L, Wu B, Tan Y, Lv O, Yi K, et al. Protein engineering with lightweight graph denoising neural networks. J Chem Inf Model. 2024;64:3650–3661.

[5]

Zhou B, Zheng L, Wu B, Yi K, Zhong B, Tan Y, et al. A conditional protein diffusion model generates artificial programmable endonuclease sequences with enhanced activity. Cell Discov. 2024;10:95.

[6]

He Y, Zhou X, Chang C, Chen G, Liu W, Li G, et al. Protein language models-assisted optimization of a uracil-n-glycosylase variant enables programmable T-to-G and T-to-C base editing. Mol Cell. 2024;84:1257–1270.

[7]

Ruffolo JA, Nayfach S, Gallagher J, Bhatnagar A, Beazer J, Hussain R, et al. Design of highly functional genome editors by modeling the universe of crispr-cas sequences. bioRxiv. 2024. https://www.biorxiv.org/content/10.1101/2024.04.22.590591v1

[8]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:6000-10.

[9]

Devlin J, Chang M-W, Lee K, Toutanova K.. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv. 2018. https://arxiv.org/abs/1810.04805

[10]

Wei J, Bosma M, Zhao VY, Guu K, Yu AW, Lester B, et al. Finetuned language models are zero-shot learners. Int Conf Learn Representations. 2022. https://api.semanticscholar.org/CorpusID:237416585

[11]

Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589.

[12]

Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–1130.

[13]

Luo Y, Jiang G, Yu T, Liu Y, Vo L, Ding H, et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat Commun. 2021;12:5743.

[14]

Lu H, Diaz DJ, Czarnecki NJ, Zhu C, Kim W, Shroff R, et al. Machine learning-aided engineering of hydrolases for PET depolymerization. Nature. 2022;604:662–667.

[15]

Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023;41:1099–1106.

[16]

Zhou Z, Zhang L, Yu Y, Wu B, Li M, Hong L, et al. Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning. Nat Commun. 2024;15:5566.

[17]

Chu AE, Lu T, Huang P-S. Sparks of function by de novo protein design. Nat Biotechnol. 2024;42:203–215.

[18]

Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA. 2021;118:2016239118.

[19]

Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv Neural Inf Process Syst. 2021;34:29287–29303.

[20]

Tan Y, Zhou B, Zheng L, Fan G, Hong L. Semantical and topological protein encoding toward enhanced bioactivity and thermostability. eLife. 2024;13:RP98033.

[21]

Heinzinger M, Weissenow K, Sanchez JG, Henkel A, Milot M, Steinegger M, Rost B. Bilingual language model for protein sequence and structure. NAR Genom and Bioinform. 2024;6:lqae150.

[22]

Tan Y, Li M, Zhou Z, Tan P, Yu H, Fan G, et al. PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications. J Cheminf. 2024;16:92.

[23]

Notin P, Dias M, Frazer J, Hurtado JM, Gomez AN, Marks D, et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In: International Conference on Machine Learning, Baltimore Convention Center. PMLR; 2022. p. 16990–17017.

[24]

Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBert: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022;38:2102–2110.

[25]

Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2022;44:7112–7127.

[26]

Ganea O, Pattanaik L, Coley C, Barzilay R, Jensen K, Green W, et al. Geomol: torsional geometric generation of molecular 3D conformer ensembles. Adv Neural Inform Process Syst. 2021;34:13757–13769.

[27]

Xu M, Yu L, Song Y, Shi C, Ermon S, Tang J. GeoDiff: a geometric diffusion model for molecular conformation generation. In: International Conference on Learning Representations; 2021.

[28]

Tan Y, Zheng L, Zhong B, Hong L, Zhou B. Protein representation learning with sequence information embedding: Does it always lead to a better performance? arXiv. 2024. https://arxiv.org/abs/2406.19755

[29]

Eismann S, Suriana P, Jing B, Townshend RJ, Dror RO. Protein model quality assessment using rotation-equivariant, hierarchical neural networks. In: NeurIPS Workshop on Machine Learning for Molecules; 2020.

[30]

Tubiana J, Schneidman-Duhovny D, Wolfson HJ. ScanNet: an interpretable geometric deep learning model for structure-based protein binding site prediction. Nat Methods. 2022;19:730–739.

[31]

Derevyanko G, Grudinin S, Bengio Y, Lamoureux G. Deep convolutional networks for quality assessment of protein folds. Bioinformatics. 2018;34:4046–4053.

[32]

Somnath VR, Bunne C, Krause A. Multi-scale representation learning on proteins. Adv Neural Inform Process Syst. 2021;34:25244–25255.

[33]

Tan Y, Zheng J, Hong L, Zhou B. ProtSolM: Protein solubility prediction with multi-modal features. arXiv. 2024. https://arxiv.org/abs/2406.19744

[34]

Li S, Tan Y, Ke S, Hong L, Zhou B. Immunogenicity prediction with dual attention enables vaccine target selection. arXiv. 2024. https://arxiv.org/abs/2410.02647

[35]

Rao RM, Liu J, Verkuil R, Meier J, Canny J, Abbeel P, et al. MSA transformer. In: International Conference on Machine Learning. PMLR; 2021. p. 8844–8856.

[36]

Su J, Ahmed M, Lu Y, Pan S, Bo W, Liu Y. RoFormer: enhanced transformer with rotary position embedding. Neurocomputing. 2024;568:127063.

[37]

Yang KK, Fusi N, Lu AX. Convolutions are competitive with transformers for protein sequence pretraining. Cell Systems. 2024;15:286–294.

[38]

Su J, Han C, Zhou Y, Shan J, Zhou X, Yuan F. SaProt: protein language modeling with structure-aware vocabulary In: The Twelfth International Conference on Learning Representations; 2023.

[39]

Van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CL, et al. Fast and accurate protein structure search with foldseek. Nat Biotechnol. 2023:243–246.

[40]

Li M, Tan Y, Ma X, Zhong B, Yu H, Zhou Z, et al. ProSST: protein language modeling with quantized structure and disentangled attention. The Thirty-eighth Annual Conference on Neural Information Processing Systems; 2024. https://openreview.net/forum?id=4Z7RZixpJQ

[41]

Ferruz N, Schmidt S, Höcker B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun. 2022;13:4348.

[42]

Hesslow D, Zanichelli N, Notin P, Poli I, Marks D. RITA: a study on scaling up generative protein sequence models. arXiv. 2022. https://arxiv.org/abs/2205.05789

[43]

Notin P, Van Niekerk L, Kollasch AW, Ritter D, Gal Y, Marks DS. TranceptEVE: Combining family-specific and family-agnostic models of protein sequences for improved fitness prediction. bioRxiv. 2022. https://doi.org/10.1101/2022.12.07.519495

[44]

Truong Jr. T, Bepler T. POET: a generative model of protein families as sequences-of-sequences. Adv Neural Inf Process Syst. 2024. https://doi.org/10.48550/arXiv.2306.06156

[45]

Chen B, Cheng X, Li P, Geng Y.-a., Gong J, Li S, et al. xTrimoPGLM: unified 100b-scale pre-trained transformer for deciphering the language of protein. arXiv. 2024. https://arxiv.org/abs/2401.06199

[46]

Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, et al. GLM: General language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2022. p. 320–335.

[47]

Elnaggar A, Essam H, Salah-Eldin W, Moustafa W, Elkerdawy M, Rochereau C, et al. Ankh: Optimized protein language model unlocks general-purpose modelling. arXiv. 2023. https://arxiv.org/abs/2301.06568

[48]

Ram S, Bepler T. Few shot protein generation. arXiv. 2022. https://arxiv.org/abs/2204.01168

[49]

Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 2021;49:D412–D419.

[50]

Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Neural message passing for quantum chemistry. In: International Conference on Machine Learning. PMLR; 2017. p. 1263–1272.

[51]

Kipf TN, Welling M., Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations; 2017.

[52]

Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y. Graph attention networks. In: International Conference on Learning Representations; 2018.

[53]

Dauparas J, Anishchenko I, Bennett N, Bai H, Ragotte RJ, Milles LF, et al. Robust deep learning–based protein sequence design using proteinmpnn. Science. 2022;378:49–56.

[54]

Ingraham J, Garg V, Barzilay R, Jaakkola T. Generative models for graph-based protein design. Adv Neural Inf Process Syst. 2019;32:15820-31.

[55]

Gao Z, Tan C, Li SZ. AlphaDesign: a graph protein design method and benchmark on AlphaFoldDB. arXiv. 2022. https://arxiv.org/abs/2202.01079

[56]

Tan C, Gao Z, Xia J, Hu B, Li SZ. Global-context aware generative protein design. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2023. p. 1–5.

[57]

Zhang Z, Xu M, Lozano A, Chenthamarakshan V, Das P, Tang J. Enhancing protein language model with structure-based encoder and pre-training. In: ICLR 2023-Machine Learning for Drug Discovery Workshop; 2023.

[58]

Han J, Rong Y, Xu T, Huang W. Geometrically equivariant graph neural networks: a survey. arXiv. 2022. https://arxiv.org/abs/2202.07230

[59]

Bronstein MM, Bruna J, Cohen T, Velickovic P. Geometric deep learning: grids, groups, graphs, geodesics, and gauges. arXiv. 2021. https://doi.org/10.48550/arXiv.2104.13478

[60]

Satorras VG, Hoogeboom E, Welling M. E (n) equivariant graph neural networks. In: International Conference on Machine Learning. PMLR; 2021. p. 9323–9332.

[61]

Liu X, Zhou B, Zhang C, Wang Y. Framelet message passing. arXiv. 2023. https://arxiv.org/pdf/2302.14806

[62]

Jing B, Eismann S, Suriana P, Townshend RJL, Dror R., Learning from protein structure with geometric vector perceptrons. In: International Conference on Learning Representations; 2020.

[63]

Aykent S, Xia T., Gbpnet: Universal geometric representation learning on protein structures. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; 2022. p. 4–14.

[64]

Yi K, Zhou B, Shen Y, Li’o P, Wang Y. Graph denoising diffusion for inverse protein folding. Adv Neural Inf Process Syst. 2024;36:10238-57.

[65]

Hu Y, Tan Y, Han A, Zheng L, Hong L, Zhou B. Secondary structure-guided novel protein sequence generation with latent graph diffusion. In: ICML 2024 AI for Science Workshop. 2024. https://arxiv.org/abs/2407.07443

[66]

Watson JL, Juergens D, Bennett NR, Trippe BL, Yim J, Eisenach HE, et al. De novo design of protein structure and function with RFDiffusion. Nature. 2023;620:1089–1100.

[67]

Thomas N, Smidt T, Kearnes S, Yang L, Li L, Kohlhoff K, et al. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv. 2018. https://arxiv.org/abs/1802.08219

[68]

Zheng J, Wang G, Huang Y, Hu B, Li S, Tan C, et al. Lightweight contrastive protein structure-sequence transformation. arXiv. 2023. https://arxiv.org/abs/2303.11783v1

[69]

Hsu C, Verkuil R, Liu J, Lin Z, Hie B, Sercu T, et al. Learning inverse folding from millions of predicted structures. In: International Conference on Machine Learning. PMLR; 2022. p. 8946–8970.

[70]

Wu F, Wu L, Radev D, Xu J, Li SZ. Integration of pre-trained protein language models into geometric deep learning networks. Commun Biol. 2023;6:876.

[71]

Hu F, Hu Y, Zhang W, Huang H, Pan Y, Yin P. A multimodal protein representation framework for quantifying transferability across biochemical downstream tasks. Adv Sci. 2023;10:2301223.

[72]

Zhang Y, Chen Y, Wang C, Lo C-C, Liu X, Wei W, et al. Prodconn-protein design using a convolutional neural network. Biophys J. 2020;118:43a–44a.

[73]

Qi Y, Zhang JZH. Densecpd: improving the accuracy of neural-network-based computational protein sequence design with densenet. J Chem Inf Model. 2020;60:1245–1252.

[74]

Castorina LV, Ünal SM, Subr K, Wood CW. TIMED-Design: flexible and accessible protein sequence design with convolutional neural networks. Protein Eng Des Sel. 2024;37:gaze002.

[75]

Chen S, Sun Z, Lin L, Liu Z, Liu X, Chong Y, et al. To improve protein sequence profile prediction through image captioning on pairwise residue distance map. J Chem Inf Model. 2019;60:391–399.

[76]

Guo Y, Wu J, Ma H, Huang J. Self-supervised pre-training for protein embeddings using tertiary structures. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36; 2022. p. 6801–6809.

[77]

Mylonas SK, Axenopoulos A, Daras P. DeepSurf: a surface-based deep learning approach for the prediction of ligand binding sites on proteins. Bioinformatics. 2021;37:1681–1690.

[78]

Anand N, Eguchi R, Mathews II, Perez CP, Derry A, Altman RB, et al. Protein sequence design with a learned potential. Nat Commun. 2022;13:746.

[79]

d’Oelsnitz S, Diaz DJ, Kim W, Acosta DJ, Dangerfield TL, Schechter MW, et al. Biosensor and machine learning-aided engineering of an amaryllidaceae enzyme. Nat Commun. 2024;15:2084.

[80]

Gainza P, Sverrisson F, Monti F, Rodolà E, Boscaini D, Bronstein MM, et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat Methods. 2020;17:184–192.

[81]

Sverrisson F, Feydy J, Correia BE, Bronstein MM., Fast end-to-end learning on protein surfaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021. p. 15272–81.

[82]

Tan Y, Li M, Zhou B, Zhong B, Zheng L, Tan P, et al. Simple, efficient, and scalable structure-aware adapter boosts protein language models. J Chem Inf Model. 2024;64:6338–6349.

[83]

Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv. 2022. https://doi.org/10.1101/2022.07.20.500902

[84]

Tan Y, Wang R, Wu B, Hong L, Zhou B. Retrieval-enhanced mutation mastery: Augmenting zero-shot prediction of protein language model. arXiv. 2024. https://arxiv.org/pdf/2410.21127

[85]

Lv L, Lin Z, Li H, Liu Y, Cui J, Chen CY-C, et al. ProLLaMA: a protein large language model for multi-task protein language processing. arXiv. 2024. https://arxiv.org/abs/2402.16445

[86]

Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al. LoRA: low-rank adaptation of large language models. Int Conf Learn Representations. 2022. https://openreview.net/forum?id=nZeVKeeFYf9

[87]

Guo L, Wang S, Li M, Cao Z. Accurate classification of membrane protein types based on sequence and evolutionary information using deep learning. BMC Bioinformatics. 2019;20:700.

[88]

Zhang D, Kabuka M. Multimodal deep representation learning for protein interaction identification and protein family classification. BMC Bioinformatics. 2019;20:531.

[89]

Wei L, Long W, Wei L. MDL-CPI: multi-view deep learning model for compound-protein interaction prediction. Methods. 2022;204:418–427.

[90]

Guo F, Zou Q, Yang G, Wang D, Tang J, Xu J. Identifying protein-protein interface via a novel multi-scale local sequence and structural representation. BMC Bioinformatics. 2019;20:483.

[91]

Chen C, Zhou J, Wang F, Liu X, Dou D. Structure-aware protein self-supervised learning. Bioinformatics. 2023;39:189.

[92]

Luo S, Chen T, Xu Y, Zheng S, Liu T-Y, Wang L, et al. One transformer can understand both 2D & 3D molecular data. Int Conf Learn Representations. 2023. https://openreview.net/pdf?id=vZTp1oPV3PC

[93]

Song T, Zhang X, Ding M, Rodriguez-Paton A, Wang S, Wang G. Deep-fusion: a deep learning based multi-scale feature fusion method for predicting drug-target interactions. Methods. 2022;204:269–277.

[94]

Liu K, Kalia RK, Liu X, Nakano A, Nomura K.-I., Vashishta P, et al. Multiscale graph neural networks for protein residue contact map prediction. arXiv. 2022. https://arxiv.org/abs/2212.02251

[95]

Nguyen DD, Wei G-W. AGL-score: algebraic graph learning score for protein–ligand binding scoring, ranking, docking, and screening. J Chem Inf Model. 2019;59:3291–3304.

[96]

Rana MM, Nguyen DD. Geometric graph learning with extended atom-types features for protein-ligand binding affinity prediction. Comput Biol Med. 2023;164:107250.

[97]

Kong Y, Li J, Zhang K, Wu J. Multi-scale self-attention mixup for graph classification. Pattern Recogn Lett. 2023;168:100–106.

[98]

Liu R, Liu X, Wu J. Persistent Path-Spectral (PPS) based machine learning for protein-ligand binding affinity prediction. J Chem Inf Model. 2023;63:1066–1075.

[99]

Xu M, Zhang Z, Lu J, Zhu Z, Zhang Y, Chang M, et al. PEER: a comprehensive and multi-task benchmark for protein sequence understanding. Adv Neural Inf Process Syst. 2022;35:35156–35173.

[100]

Capel H, Weiler R, Dijkstra M, Vleugels R, Bloem P, Feenstra KA. ProteinGLUE multi-task benchmark suite for self-supervised protein modeling. Sci Rep. 2022;12:16047.

[101]

Notin P, Kollasch A, Ritter D, Van Niekerk L, Paul S, Spinner H, et al. ProteinGym: large-scale benchmarks for protein fitness prediction and design. Adv Neural Inf Process Syst. 2024. https://doi.org/10.1101/2023.12.07.570727

[102]

Riesselman AJ, Ingraham JB, Marks DS. Deep generative models of genetic variation capture the effects of mutations. Nat Methods. 2018;15:816–822.

[103]

Gray VE, Hause RJ, Luebeck J, Shendure J, Fowler DM. Quantitative missense variant effect prediction using large-scale mutagenesis data. Cell Syst. 2018;6:116–124.

[104]

Dallago C, Mou J, Johnston KE, Wittmann BJ, Bhattacharya N, Goldman S, et al. Benchmark tasks in fitness landscape inference for proteins. Adv Neural Inf Process Syst. 2024. https://openreview.net/forum?id=p2dMLEwL8tF

[105]

Li M, Yu H, Fan G, Zhou Z, Tan P, Hong L. FS-mutant: a few-shot learning benchmark for protein mutants mining. Research Square. 2024. https://doi.org/10.21203/rs.3.rs-3987561/v1

[106]

Velecký J, Hamsikova M, Stourac J, Musil M, Damborsky J, et al. SoluProtMutDB: a manually curated database of protein solubility changes upon mutations. Comput Struct Biotechnol J. 2022;20:6339–6347.

[107]

Stourac J, Dubrava J, Musil M, Horackova J, Damborsky J, Mazurenko S, et al. FireProtDB: database of manually curated protein stability data. Nucleic Acids Res. 2021;49:D319–D324.

[108]

Nikam R, Kulandaisamy A, Harini K, Sharma D, Gromiha MM. Prothermdb: thermodynamic database for proteins and mutants revisited after 15 years. Nucleic Acids Res. 2021;49:D420–D424.

[109]

Sillitoe I, Bordin N, Dawson N, Waman VP, Ashford P, Scholes HM, et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 2021;49:D266–D273.

[110]

Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, et al. Highly accurate protein structure prediction for the human proteome. Nature. 2021;596:590–596.

[111]

Cheng H, Kim B-H, Grishin NV. Malidup: a database of manually constructed structure alignments for duplicated domain pairs. Proteins: Struct, Funct, Bioinf. 2008;70:1162–1166.

[112]

Cheng H, Kim B-H, Grishin NV. Malisam: a database of structurally analogous motifs in proteins. Nucleic Acids Res. 2007;36:D211–D217.

[113]

Paysan-Lafosse T, Blum M, Chuguransky S, Grego T, Pinto BL, Salazar GA, et al. Interpro in 2022. Nucleic Acids Res. 2023;51:D418–D427.

[114]

Pieper U, Webb BM, Dong GQ, Schneidman-Duhovny D, Fan H, Kim SJ, et al. Modbase, a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res. 2014;42:D336–D346.

[115]

Gabanyi MJ, Adams PD, Arnold K, Bordoli L, Carter LG, Flippen-Andersen J, et al. The structural biology knowledgebase: a portal to protein structures, sequences, functions, and methods. J Struct Funct Genomics. 2011;12:45–54.

[116]

Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D. PROSO II – a new method for protein solubility prediction. FEBS J. 2012;279:2192–2200.

[117]

Khurana S, Rawi R, Kunji K, Chuang G-Y, Bensmail H, Mall R. DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics. 2018;34:2605–2613.

[118]

Wang C, Zou Q. Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with deepsolue. BMC Biol. 2023;21:12.

[119]

Hebditch M, Carballo-Amador MA, Charonis S, Curtis R, Warwicker J. Protein–sol: a web tool for predicting protein solubility from sequence. Bioinformatics. 2017;33:3098–3100.

[120]

Hon J, Marusiak M, Martinek T, Kunka A, Zendulka J, Bednar D, et al. Soluprot: prediction of soluble protein expression in Escherichia coli. Bioinformatics. 2021;37:23–28.

[121]

Niwa T, Ying B-W, Saito K, Jin W, Takada S, Ueda T, et al. Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins. Proc Natl Acad Sci USA. 2009;106:4201–4206.

[122]

Almagro Armenteros JJ, Sønderby CK, Sønderby SK, Nielsen H, Winther O. Deeploc: prediction of protein subcellular localization using deep learning. Bioinformatics. 2017;33:3387–3395.

[123]

Thumuluri V, Almagro Armenteros JJ, Johansen AR, Nielsen H, Winther O. Deeploc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic Acids Res. 2022;50:W228–W234.

[124]

Sprenger J, Lynn Fink J, Karunaratne S, Hanson K, Hamilton NA, Teasdale RD. Locate: a mammalian protein subcellular localization database. Nucleic Acids Res. 2007;36:D230–D233.

[125]

Horton P, Park K-J, Obayashi T, Fujita N, Harada H, Adams-Collier CJ, et al. Wolf psort: protein localization predictor. Nucleic Acids Res. 2007;35:W585–W587.

[126]

Chou K-C, Wu Z-C, Xiao X. iloc-euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS One. 2011;6:e18258.

[127]

Shen H-B, Chou K-C. A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mploc 2.0. Anal Biochem. 2009;394:269–274.

[128]

Chou K-C, Shen H-B. Plant-mploc: a top-down strategy to augment the power for predicting plant protein subcellular localization. PLoS One. 2010;5:e11335.

[129]

The Gene Ontology Consortium. The gene ontology resource: 20 years and still going strong. Nucleic Acids Res. 2019;47:330–338.

[130]

Robinson PK. Enzymes: principles and biotechnological applications. Essays Biochem. 2015;59:1–41.

[131]

Mi H, Ebert D, Muruganujan A, Mills C, Albou L-P, Mushayamaha T, et al. Panther version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive api. Nucleic Acids Res. 2021;49:D394–D403.

[132]

Karp PD, Billington R, Caspi R, Fulcher CA, Latendresse M, Kothari A, et al. The biocyc collection of microbial genomes and metabolic pathways. Brief Bioinform. 2019;20:1085–1093.

[133]

Bairoch A. The swiss-prot protein sequence database and its supplement trembl in 2000. Nucleic Acids Res. 2000;28:45–48.

[134]

UniProt Consortium. Uniprot: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47:506–515.

[135]

Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK. BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res. 2007;35:D198–D201.

[136]

Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, et al. Chembl: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012;40:D1100–D1107.

[137]

Liu Z, Su M, Han L, Liu J, Yang Q, Li Y, et al. Forging the basis for developing protein-ligand interaction scoring functions. Acc Chem Res. 2017;50:302–309.

[138]

Van Linden OPJ, Kooistra AJ, Leurs R, De Esch IJP, De Graaf C. Klifs: a knowledge-based structural database to navigate kinase–ligand interaction space. J Med Chem. 2014;57:249–277.

[139]

Knox C, Wilson M, Klinger CM, Franklin M, Oler E, Wilson A, et al. Drugbank 6.0: the drugbank knowledgebase for 2024. Nucleic Acids Res. 2024;52:D1265–D1275.

[140]

Zhang C, Zhang X, Freddolino PL, Zhang Y. Biolip2: an updated structure database for biologically relevant ligand–protein interactions. Nucleic Acids Res. 2024;52:D404–D412.

[141]

Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, et al. The string database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023;51:D638–D646.

[142]

Oughtred R, Rust J, Chang C, Breitkreutz B-J, Stark C, Willems A, et al. The biogrid database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Prot Sci. 2021;30:187–200.

[143]

Del Toro N, Shrivastava A, Ragueneau E, Meldal B, Combe C, Barrera E, et al. The intact database: efficient access to fine-grained molecular interaction data. Nucleic Acids Res. 2022;50:648–653.

[144]

Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003;13:2363–2371.

[145]

Yang KK, Zanichelli N, Yeh H. Masked inverse folding with sequence transfer for protein representation learning. Protein Eng Des Sel. 2023;36:015.

[146]

Rao R, Bhattacharya N, Thomas N, Duan Y, Chen P, Canny J, et al. Evaluating protein transfer learning with TAPE. Adv Neural Inf Process Syst. 2019;32:9689-701.

[147]

Burley SK, Berman HM, Bhikadiya C, Bi C, Chen L, Di Costanzo L, et al. Rcsb protein data bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Res. 2019;47:D464–D474.

[148]

Chandonia J-M, Fox NK, Brenner SE. Scope: classification of large macromolecular structures in the structural classification of proteins—extended database. Nucleic Acids Res. 2019;47:D475–D481

[149]

Stan R-V, Arden KC, Palade GE. cdna and protein sequence, genomic organization, and analysis of cis regulatory elements of mouse and human plvap genes. Genomics. 2001;72:304–313.

[150]

Li M, Zhou B, Tan Y, Hong L. Unlearning virus knowledge toward safe and responsible mutation effect predictions. bioRxiv. 2024.

RIGHTS & PERMISSIONS

2024 The Author(s). mLife published by John Wiley & Sons Australia, Ltd on behalf of Institute of Microbiology, Chinese Academy of Sciences.

AI Summary AI Mindmap
PDF

219

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/