Foundation models in molecular biology

Yunda Si, Jiawei Zou, Yicheng Gao, Guohui Chuai, Qi Liu, Luonan Chen

Biophysics Reports ›› 2024, Vol. 10 ›› Issue (3) : 135-151.

PDF(3107 KB)
PDF(3107 KB)
Biophysics Reports ›› 2024, Vol. 10 ›› Issue (3) : 135-151. DOI: 10.52601/bpr.2024.240006
REVIEW

Foundation models in molecular biology

Author information +
History +

Abstract

Determining correlations between molecules at various levels is an important topic in molecular biology. Large language models have demonstrated a remarkable ability to capture correlations from large amounts of data in the field of natural language processing as well as image generation, and correlations captured from data using large language models can also be applicable to solving a wide range of specific tasks, hence large language models are also referred to as foundation models. The massive amount of data that exists in the field of molecular biology provides an excellent basis for the development of foundation models, and the recent emergence of foundation models in the field of molecular biology has really pushed the entire field forward. We summarize the foundation models developed based on RNA sequence data, DNA sequence data, protein sequence data, single-cell transcriptome data, and spatial transcriptome data respectively, and further discuss the research directions for the development of foundation models in molecular biology.

Graphical abstract

Keywords

Foundation models / Molecular biology / Transcriptome

Cite this article

Download citation ▾
Yunda Si, Jiawei Zou, Yicheng Gao, Guohui Chuai, Qi Liu, Luonan Chen. Foundation models in molecular biology. Biophysics Reports, 2024, 10(3): 135‒151 https://doi.org/10.52601/bpr.2024.240006

References

[1]
Abdelaal T , Mourragui S , Mahfouz A , Reinders MJT . SpaGE: spatial gene enhancement using scRNA-Seq. Nucleic Acids Res, 2020, 48(18): e107
CrossRef Google scholar
[2]
Baek M , DiMaio F , Anishchenko I , Dauparas J , Ovchinnikov S , Lee GR , Wang J , Cong Q , Kinch LN , Schaeffer RD , Millán C , Park H , Adams C , Glassman CR , DeGiovanni A , Pereira JH , Rodrigues AV , van Dijk AA , Ebrecht AC , Opperman DJ , Sagmeister T , Buhlheller C , Pavkov-Keller T , Rathinaswamy MK , Dalwadi U , Yip CK , Burke JE , Garcia KC , Grishin NV , Adams PD , Read RJ , Baker D . Accurate prediction of protein structures and interactions using a three-track neural network. Science, 2021, 373(6557): 871–876
CrossRef Google scholar
[3]
Baek M , McHugh R , Anishchenko I , Jiang H , Baker D , DiMaio F . Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA. Nat Methods, 2024, 21(1): 117–121
CrossRef Google scholar
[4]
Bafna M , Li H , Zhang X . CLARIFY: cell–cell interaction and gene regulatory network refinement from spatially resolved transcriptomics. Bioinformatics, 2023, 39(Suppl 1): i484–i493
[5]
Bai XC , McMullan G , Scheres SH . How Cryo-EM is revolutionizing structural biology. Trends Biochem Sci, 2015, 40(1): 49–57
CrossRef Google scholar
[6]
Benegas G , Batra SS , Song YS . DNA language models are powerful predictors of genome-wide variant effects. Proc Natl Acad Sci USA, 2023, 120(44): e2311219120
CrossRef Google scholar
[7]
Ben-Tal N , Kolodny R . Homologues not needed: structure prediction from a protein language model. Structure, 2022, 30(8): 1047–1049
CrossRef Google scholar
[8]
Bepler T , Berger B . Learning the protein language: evolution, structure, and function. Cell Systems, 2021, 12(6): 654–669
CrossRef Google scholar
[9]
Biancalani T , Scalia G , Buffoni L , Avasthi R , Lu Z , Sanger A , Tokcan N , Vanderburg CR , Segerstolpe Å , Zhang M , Avraham-Davidi I , Vickovic S , Nitzan M , Ma S , Subramanian A , Lipinski M , Buenrostro J , Brown NB , Fanelli D , Zhuang X , Macosko EZ , Regev A . Deep learning and alignment of spatially resolved single-cell transcriptomes with tangram. Nat Methods, 2021, 18(11): 1352–1362
CrossRef Google scholar
[10]
Brown TBMann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Advances in Neural Information Processing Systems. pp. 1877–1901
[11]
Brunger AT . Version 1.2 of the crystallography and NMR system. Nat Protocols, 2007, 2(11): 2728–2733
[12]
Cao Y , Zhu J , Jia P , Zhao Z . scRNASeqDB: a database for RNA-Seq based gene expression profiles in human single cells. Genes (Basel), 2017, 8(12): 368.
CrossRef Google scholar
[13]
Chaudhury S , Lyskov S , Gray JJ . PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics, 2010, 26(5): 689–691
CrossRef Google scholar
[14]
Chen A , Liao S , Cheng M , Ma K , Wu L , Lai Y , Qiu X , Yang J , Xu J , Hao S , Wang X , Lu H , Chen X , Liu X , Huang X , Li Z , Hong Y , Jiang Y , Peng J , Liu S , Shen M , Liu C , Li Q , Yuan Y , Wei X , Zheng H , Feng W , Wang Z , Liu Y , Wang Z , Yang Y , Xiang H , Han L , Qin B , Guo P , Lai G , Muñoz-Cánoves P , Maxwell PH , Thiery JP , Wu QF , Zhao F , Chen B , Li M , Dai X , Wang S , Kuang H , Hui J , Wang L , Fei JF , Wang O , Wei X , Lu H , Wang B , Liu S , Gu Y , Ni M , Zhang W , Mu F , Yin Y , Yang H , Lisby M , Cornall RJ , Mulder J , Uhlén M , Esteban MA , Li Y , Liu L , Xu X , Wang J . Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell, 2022, 185(10): 1777–1792
CrossRef Google scholar
[15]
Chen J, Hu Z, Sun S, Tan Q, Wang Y, Yu Q, Zong L, Hong L, Xiao J, Shen T, King I, Li Y (2022) Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. arXiv. https://doi.org/10.48550/arXiv.2204.00300
[16]
Chen M , Ma Y , Wu S , Zheng X , Kang H , Sang J , Xu X , Hao L , Li Z , Gong Z , Xiao J , Zhang Z , Zhao W , Bao Y . Genome warehouse: a public repository housing genome-scale data. Genomics, Proteomics Bioinformatics, 2021, 19(4): 584–589
CrossRef Google scholar
[17]
Chen S , Zhang B , Chen X , Zhang X , Jiang R . stPlus: a reference-based method for the accurate enhancement of spatial transcriptomics. Bioinformatics, 2021, 37(Suppl_1): i299–i307
[18]
Chen WT , Lu A , Craessaerts K , Pavie B , Sala Frigerio C , Corthout N , Qian X , Laláková J , Kühnemund M , Voytyuk I , Wolfs L , Mancuso R , Salta E , Balusu S , Snellinx A , Munck S , Jurek A , Fernandez Navarro J , Saido TC , Huitinga I , Lundeberg J , Fiers M , De Strooper B . Spatial transcriptomics and in situ sequencing to study Alzheimer’s disease. Cell, 2020, 182(4): 976–991
CrossRef Google scholar
[19]
Chowdhury R , Bouatta N , Biswas S , Floristean C , Kharkar A , Roy K , Rochereau C , Ahdritz G , Zhang J , Church GM , Sorger PK , AlQuraishi M . Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol, 2022, 40(11): 1617–1623
CrossRef Google scholar
[20]
Chuai G , Ma H , Yan J , Chen M , Hong N , Xue D , Zhou C , Zhu C , Chen K , Duan B , Gu F , Qu S , Huang D , Wei J , Liu Q . DeepCRISPR: optimized CRISPR guide RNA design by deep learning. Genome Biol, 2018, 19(1): 80
CrossRef Google scholar
[21]
Cirillo D , Federico A , Tartaglia GG . Predictions of protein–RNA interactions. WIREs Comput Mol Sci, 2012, 3(2): 161–175
[22]
Cui H, Wang C, Maan H, Duan N, Wang B (2022) scFormer: a universal representation learning approach for single-cell data using transformers. bioRxiv. https://doi.org/10.1101/2022.11.20.517285
[23]
Cui H, Wang C, Maan H, Pang K, Luo F, Duan N, Wang B (2023) scGPT: towards building a foundation model for single-cell multi-omics using generative AI. Nat Methods. https:// doi.org/10.1038/s41592-024-02201-0
[24]
Cui Y, Che W, Liu T, Qin B, Wang S, Hu G (2020) Revisiting pre-trained models for Chinese natural language processing. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 657–668
[25]
Dai H , Li L , Zeng T , Chen L . Cell-specific network constructed by single-cell RNA sequencing data. Nucleic Acids Res, 2019, 47(11): e62
CrossRef Google scholar
[26]
Dalla-Torre H, Gonzalez L, Revilla JM, Carranza NL, Grzywaczewski AH, Oteri F, Dallago C, Trop E, Sirelkhatim H, Richard G, Skwark M, Beguir K, Lopez M, Pierrot T (2023) The nucleotide transformer: building and evaluating robust foundation models for human genomics. bioRxiv. https://doi.org/10.1101/2023.01.11.523679
[27]
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186
[28]
Ding W , Mao W , Shao D , Zhang W , Gong H . DeepConPred2: An improved method for the prediction of protein residue contacts. Comput Struct Biotechnol J, 2018, 16: 503–510
CrossRef Google scholar
[29]
Dobson CM . Protein misfolding, evolution and disease. Trends Biochem Sci, 1999, 24(9): 329–332
CrossRef Google scholar
[30]
Dodge J, Ilharco G, Schwartz R, Farhadi A, Hajishirzi H, Smith N (2020) Fine-tuning pretrained language models: weight initializations, data orders, and early stopping. arXiv. https://doi.org/10.48550/arXiv.2002.06305
[31]
Dong K , Zhang S . Deciphering Spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder. Nat Commun, 2022, 13(1): 1739
CrossRef Google scholar
[32]
Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, Gao J, Zhou M, Hon H-W (2019) Unified language model pre-training for natural language understanding and generation. arXiv. https://doi.org/10.48550/arXiv.1905.03197
[33]
Elnaggar A , Heinzinger M , Dallago C , Rehawi G , Wang Y , Jones L , Gibbs T , Fehér TB , Angerer C , Steinegger M , Bhowmik D , Rost B . ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans Pattern Anal Mach Intell, 2022, 44(10): 7112–7127
CrossRef Google scholar
[34]
Elosua-Bayes M , Nieto P , Mereu E , Gut I , Heyn H . SPOTlight: seeded NMF regression to deconvolute spatial transcriptomics spots with single-cell transcriptomes. Nucleic Acids Res, 2021, 49(9): e50
CrossRef Google scholar
[35]
Ethayarajh K (2019) How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 55–65
[36]
Ferri-Borgogno S , Zhu Y , Sheng J , Burks JK , Gomez JA , Wong KK , Wong STC , Mok SC . Spatial transcriptomics depict ligand-receptor cross-talk heterogeneity at the tumor-stroma interface in long-term ovarian cancer survivors. Cancer Res, 2023, 83(9): 1503–1516
CrossRef Google scholar
[37]
Ferruz N , Schmidt S , Höcker B . ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun, 2022, 13(1): 4348
CrossRef Google scholar
[38]
Fu H, Xu H, Chong K, Li M, Ang KS, Lee HK, Ling J, Chen A, Shao L, Liu L, Chen J (2021) Unsupervised spatially embedded deep representation of spatial transcriptomics. bioRxiv. https://doi.org/10.1101/2021.06.15.448542
[39]
Gao Z , Jiang C , Zhang J , Jiang X , Li L , Zhao P , Yang H , Huang Y , Li J . Hierarchical graph learning for protein–protein interaction. Nat Commun, 2023, 14(1): 1093
CrossRef Google scholar
[40]
Golkov, Vladimir, Marcin J. Skwark, Antonij Golkov, Alexey Dosovitskiy, Thomas Brox, Jens Meiler, and Daniel Cremers (2016) Protein contact prediction from amino acid co-evolution using convolutional networks for graph-valued images. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. pp. 4222–4230
[41]
Goodsell DS , Zardecki C , Di Costanzo L , Duarte JM , Hudson BP , Persikova I , Segura J , Shao C , Voigt M , Westbrook JD , Young JY , Burley SK . RCSB Protein Data Bank: enabling biomedical research and drug discovery. Protein Sci, 2020, 29(1): 52–65
CrossRef Google scholar
[42]
Hao M, Gong J, Zeng X, Liu C, Guo Y, Cheng X, Wang T, Ma J, Song L, Zhang X (2023) Large scale foundation model on single-cell transcriptomics. bioRxiv. https://doi.org/10.1101/2023.05.29.542705
[43]
Hartl FU . Protein misfolding diseases. Annu Rev Biochem, 2017, 86(1): 21–26
CrossRef Google scholar
[44]
He B , Mortuza SM , Wang Y , Shen HB , Zhang Y . NeBcon: protein contact map prediction using neural network training coupled with naïve Bayes classifiers. Bioinformatics, 2017, 33(15): 2296–2306
CrossRef Google scholar
[45]
He K , Gkioxari G , Dollar P , Girshick R . Mask R-CNN. IEEE Trans Pattern Anal Mach Intell, 2020, 42(2): 386–397
CrossRef Google scholar
[46]
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778
[47]
Heinzinger M , Elnaggar A , Wang Y , Dallago C , Nechaev D , Matthes F , Rost B . Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics, 2019, 20(1): 723
CrossRef Google scholar
[48]
Henderson B , Pockley AG . Molecular chaperones and protein-folding catalysts as intercellular signaling regulators in immunity and inflammation. J Leukoc Biol, 2010, 88(3): 445–462
CrossRef Google scholar
[49]
Hesslow D, Zanichelli N, Notin P, Poli I, Marks D (2022) RITA: a study on scaling up generative protein sequence models. arXiv. https://doi.org/10.48550/arXiv.2205.05789
[50]
Hong Y , Lee J , Ko J . A-Prot: protein structure modeling using MSA transformer. BMC Bioinformatics, 2022, 23(1): 93
[51]
Hu J , Li X , Coleman K , Schroeder A , Ma N , Irwin DJ , Lee EB , Shinohara RT , Li M . SpaGCN: integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat Methods, 2021, 18(11): 1342–1351
CrossRef Google scholar
[52]
Iacono G , Massoni-Badosa R , Heyn H . Single-cell transcriptomics unveils gene regulatory network plasticity. Genome Biol, 2019, 20(1): 110
CrossRef Google scholar
[53]
Jankowsky E , Harris ME . Specificity and nonspecificity in RNA–protein interactions. Nat Rev Mol Cell Biol, 2015, 16(9): 533–544
CrossRef Google scholar
[54]
Ji Y , Zhou Z , Liu H , Davuluri RV . DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics, 2021, 37(15): 2112–2120
CrossRef Google scholar
[55]
Jones DT , Singh T , Kosciolek T , Tetchner S . MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics (Oxford, England), 2015, 31(7): 999–1006
[56]
Joshi V, Peters M, Hopkins M (2018) Extending a parser to distant domains using a few dozen partially annotated examples. arXiv. https://doi.org/10.48550/arXiv.1805.06556
[57]
Jovic D , Liang X , Zeng H , Lin L , Xu F , Luo Y . Single-cell RNA Sequencing technologies and applications: a brief overview. Clin Transl Med, 2022, 12(3): e694
CrossRef Google scholar
[58]
Ju F , Zhu J , Shao B , Kong L , Liu TY , Zheng WM , Bu D . CopulaNet: learning residue co-evolution directly from multiple sequence alignment for protein structure prediction. Nat Commun, 2021, 12(1): 2535
CrossRef Google scholar
[59]
Jumper J , Evans R , Pritzel A , Green T , Figurnov M , Ronneberger O , Tunyasuvunakool K , Bates R , Žídek A , Potapenko A , Bridgland A , Meyer C , Kohl SAA , Ballard AJ , Cowie A , Romera-Paredes B , Nikolov S , Jain R , Adler J , Back T , Petersen S , Reiman D , Clancy E , Zielinski M , Steinegger M , Pacholska M , Berghammer T , Bodenstein S , Silver D , Vinyals O , Senior AW , Kavukcuoglu K , Kohli P , Hassabis D . Highly accurate protein structure prediction with AlphaFold. Nature, 2021, 596(7873): 583–589
CrossRef Google scholar
[60]
Kim DE , Dimaio F , Yu-Ruei Wang R , Song Y , Baker D . One contact for every twelve residues allows robust and accurate topology-level protein structure modeling. Proteins, 2014, 82(S2): 208–218
CrossRef Google scholar
[61]
Klein T, Nabi M (2019) Learning to answer by learning to ask: getting the best of GPT-2 and BERT worlds. arXiv. https://doi.org/10.48550/arXiv.1911.02365
[62]
Kleshchevnikov V , Shmatko A , Dann E , Aivazidis A , King HW , Li T , Elmentaite R , Lomakin A , Kedlian V , Gayoso A , Jain MS , Park JS , Ramona L , Tuck E , Arutyunyan A , Vento-Tormo R , Gerstung M , James L , Stegle O , Bayraktar OA . Cell2location maps fine-grained cell types in spatial transcriptomics. Nat Biotechnol, 2022, 40(5): 661–671
CrossRef Google scholar
[63]
Kolodziejczyk AA , Kim JK , Svensson V , Marioni JC , Teichmann SA . The technology and biology of single-cell RNA sequencing. Mol Cell, 2015, 58(4): 610–620
CrossRef Google scholar
[64]
Kulmanov M , Hoehndorf R . DeepGOPlus: improved protein function prediction from sequence. Bioinformatics, 2020, 36(2): 422–429
CrossRef Google scholar
[65]
Lecun Y , Bottou L , Bengio Y , Haffner P . Gradient-based learning applied to document recognition. Proc IEEE, 1998, 86(11): 2278–2324
CrossRef Google scholar
[66]
Lenz S , Sinn LR , O'Reilly FJ , Fischer L , Wegner F , Rappsilber J . Reliable identification of protein-protein interactions by crosslinking mass spectrometry. Nat Communs, 2021, 12(1): 3564
CrossRef Google scholar
[67]
Li J , Chen S , Pan X , Yuan Y , Shen HB . Cell clustering for spatial transcriptomics data with graph neural networks. Nat Comput Sci, 2022a, 2(6): 399–408
CrossRef Google scholar
[68]
Li JH , Liu S , Zhou H , Qu LH , Yang JH . starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein–RNA interaction networks from large-scale CLIP-seq data. Nucleic Acids Res, 2014, 42(D1): D92–97
[69]
Li X , Han P , Chen W , Gao C , Wang S , Song T , Niu M , Rodriguez-Patón A . MARPPI: boosting prediction of protein–protein interactions with multi-scale architecture residual network. Briefings Bioinform, 2022b, 24(1): bbac524
CrossRef Google scholar
[70]
Li Y , Zhang C , Feng C , Pearce R , Lydia Freddolino P , Zhang Y . Integrating end-to-end learning with deep geometrical potentials for ab initio RNA structure prediction. Nat Commun, 2023, 14(1): 5745
CrossRef Google scholar
[71]
Limo MJ , Sola-Rabada A , Boix E , Thota V , Westcott ZC , Puddu V , Perry CC . Interactions between metal oxides and biomolecules: from fundamental understanding to applications. Chem Rev, 2018, 118(22): 11118–11193
CrossRef Google scholar
[72]
Lin Z , Akin H , Rao R , Hie B , Zhu Z , Lu W , Smetanin N , Verkuil R , Kabeli O , Shmueli Y , Dos Santos Costa A , Fazel-Zarandi M , Sercu T , Candido S , Rives A . Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 2023, 379(6637): 1123–1130
CrossRef Google scholar
[73]
Liu L, Li W, Wong K-C, Yang F, Yao J (2023) A pre-trained large generative model for translating single-cell transcriptome to proteome. bioRxiv. https://doi.org/10.1101/2023.07.04.547619
[74]
Liu T , Fang ZY , Zhang Z , Yu Y , Li M , Yin MZ . A comprehensive overview of graph neural network-based approaches to clustering for spatial transcriptomics. Comput Struct Biotechnol J, 2024, 23: 106–128
CrossRef Google scholar
[75]
Long Y , Ang KS , Li M , Chong KLK , Sethi R , Zhong C , Xu H , Ong Z , Sachaphibulkij K , Chen A , Zeng L , Fu H , Wu M , Lim LHK , Liu L , Chen J . Spatially informed clustering, integration, and deconvolution of spatial transcriptomics with GraphST. Nat Commun, 2023, 14(1): 1155
CrossRef Google scholar
[76]
Lu H , Zhou Q , He J , Jiang Z , Peng C , Tong R , Shi J . Recent advances in the development of protein–protein interactions modulators: mechanisms and clinical trials. Signal Transduct Target Ther, 2020, 5(1): 213
CrossRef Google scholar
[77]
Madani A , Krause B , Greene ER , Subramanian S , Mohr BP , Holton JM , Olmos JL Jr , Xiong C , Sun ZZ , Socher R , Fraser JS , Naik N . Large language models generate functional protein sequences across diverse families. Nat Biotechnol, 2023, 41(8): 1099–1106
CrossRef Google scholar
[78]
Mann M , Wright PR , Backofen R . IntaRNA 2.0: enhanced and customizable prediction of RNA–RNA interactions. Nucleic Acids Res, 2017, 45(W1): W435–W439
[79]
McDowall MD , Scott MS , Barton GJ . PIPs: human protein–protein interaction prediction database. Nucleic Acids Res, 2009, 37(suppl_1): D651–D656
[80]
Mirdita M , von den Driesch L , Galiez C , Martin MJ , Söding J , Steinegger M . Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res, 2017, 45(D1): D170–D176
CrossRef Google scholar
[81]
Mistry J , Chuguransky S , Williams L , Qureshi M , Salazar GA , Sonnhammer ELL , Tosatto SCE , Paladin L , Raj S , Richardson LJ , Finn RD , Bateman A . Pfam: the protein families database in 2021. Nucleic Acids Res, 2021, 49(D1): D412–D419
CrossRef Google scholar
[82]
Moreno P , Fexova S , George N , Manning JR , Miao Z , Mohammed S , Muñoz-Pomer A , Fullgrabe A , Bi Y , Bush N , Iqbal H , Kumbham U , Solovyev A , Zhao L , Prakash A , García-Seisdedos D , Kundu DJ , Wang S , Walzer M , Clarke L , Osumi-Sutherland D , Tello-Ruiz MK , Kumari S , Ware D , Eliasova J , Arends MJ , Nawijn MC , Meyer K , Burdett T , Marioni J , Teichmann S , Vizcaíno JA , Brazma A , Papatheodorou I . Expression atlas update: gene and protein expression in multiple species. Nucleic Acids Res, 2022, 50(D1): D129–D140
CrossRef Google scholar
[83]
NCBI Resource Coordinators . Database resources of the national center for biotechnology information. Nucleic Acids Rese, 2014, 42(D1): D7–D17
CrossRef Google scholar
[84]
Nguyen TC , Cao X , Yu P , Xiao S , Lu J , Biase FH , Sridhar B , Huang N , Zhang K , Zhong S . Mapping RNA–RNA interactome and RNA structure in vivo by MARIO. Nat Commun, 2016, 7(1): 12023
CrossRef Google scholar
[85]
Nooren IMA , Thornton JM . Diversity of protein–protein interactions. EMBO J, 2003, 22(14): 3486–3492
CrossRef Google scholar
[86]
Oughtred R , Rust J , Chang C , Breitkreutz BJ , Stark C , Willems A , Boucher L , Leung G , Kolas N , Zhang F , Dolma S , Coulombe-Huntington J , Chatr-Aryamontri A , Dolinski K , Tyers M . The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci, 2021, 30(1): 187–200
CrossRef Google scholar
[87]
Pang Y , Liu B . IDP-LM: prediction of protein intrinsic disorder and disorder functions based on language models. PLoS Computat Biol, 2023, 19(11): e1011657
CrossRef Google scholar
[88]
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long Papers). pp. 2227–2237
[89]
Pokharel S , Pratyush P , Heinzinger M , Newman RH , Kc DB . Improving protein succinylation sites prediction using embeddings from protein language model. Sci Rep, 2022, 12: 16933
CrossRef Google scholar
[90]
Puton T , Kozlowski L , Tuszynska I , Rother K , Bujnicki JM . Computational methods for prediction of protein–RNA interactions. J Struct Biol, 2012, 179(3): 261–268
CrossRef Google scholar
[91]
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training. https://openai-assets.s3.amazonaws.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[92]
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
[93]
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv. https://doi.org/10.48550/arXiv.1910.10683
[94]
Ramanathan M , Porter DF , Khavari PA . Methods to study RNA–protein interactions. Nat Methods, 2019, 16(3): 225–234
CrossRef Google scholar
[95]
Rao R , Bhattacharya N , Thomas N , Duan Y , Chen X , Canny J , Abbeel P , Song YS . Evaluating protein transfer learning with TAPE. Adv Neural Inf Process Syst, 2019, 32: 9689–9701
[96]
Rao RM, Liu J, Verkuil R, Meier J, Canny J, Abbeel P, Sercu T, Rives A (2021) MSA Transformer. In: Proceedings of the 38th International Conference on Machine Learning. pp. 8844–8856
[97]
Rao VS , Srinivas K , Sujini GN , Kumar GN . Protein-protein interaction detection: methods and analysis. Int J Proteomics, 2014, 2014: 147648
CrossRef Google scholar
[98]
Rives A , Meier J , Sercu T , Goyal S , Lin Z , Liu J , Guo D , Ott M , Zitnick CL , Ma J , Fergus R . Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA, 2021, 118(15): e2016239118
CrossRef Google scholar
[99]
Rodriques SG , Stickels RR , Goeva A , Martin CA , Murray E , Vanderburg CR , Welch J , Chen LM , Chen F , Macosko EZ . Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science, 2019, 363(6434): 1463–1467
CrossRef Google scholar
[100]
Rual JF , Venkatesan K , Hao T , Hirozane-Kishikawa T , Dricot A , Li N , Berriz GF , Gibbons FD , Dreze M , Ayivi-Guedehoussou N , Klitgord N , Simon C , Boxem M , Milstein S , Rosenberg J , Goldberg DS , Zhang LV , Wong SL , Franklin G , Li S , Albala JS , Lim J , Fraughton C , Llamosas E , Cevik S , Bex C , Lamesch P , Sikorski RS , Vandenhaute J , Zoghbi HY , Smolyar A , Bosak S , Sequerra R , Doucette-Stamm L , Cusick ME , Hill DE , Roth FP , Vidal M . Towards a proteome-scale map of the human protein–protein interaction network. Nature, 2005, 437(7062): 1173–1178
CrossRef Google scholar
[101]
Senior AW , Evans R , Jumper J , Kirkpatrick J , Sifre L , Green T , Qin C , Žídek A , Nelson AWR , Bridgland A , Penedones H , Petersen S , Simonyan K , Crossan S , Kohli P , Jones DT , Silver D , Kavukcuoglu K , Hassabis D . Improved protein structure prediction using potentials from deep learning. Nature, 2020, 577(7792): 706–710
CrossRef Google scholar
[102]
Shah S , Takei Y , Zhou W , Lubeck E , Yun J , Eng CL , Koulena N , Cronin C , Karp C , Liaw EJ , Amin M , Cai L . Dynamics and spatial genomics of the nascent transcriptome by intron seqFISH. Cell, 2018, 174(2): 363–376
CrossRef Google scholar
[103]
Singh R , Devkota K , Sledzieski S , Berger B , Cowen L . Topsy-Turvy: integrating a global view into sequence-based PPI prediction. Bioinformatics, 2022, 38(Suppl_1): i264–i272
[104]
Sledzieski S , Singh R , Cowen L , Berger B . D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions. Cell Systems, 2021, 12(10): 969–682
CrossRef Google scholar
[105]
Song Q , Su J . DSTG: deconvoluting spatial transcriptomics data through graph-based artificial intelligence. BriefBioinform, 2021, 22(5): bbaa414
CrossRef Google scholar
[106]
Stickels RR , Murray E , Kumar P , Li J , Marshall JL , Di Bella DJ , Arlotta P , Macosko EZ , Chen F . Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat Biotechnol, 2021, 39(3): 313–319
CrossRef Google scholar
[107]
Tang Z , Li Z , Hou T , Zhang T , Yang B , Su J , Song Q . SiGra: single-cell spatial elucidation through an image-augmented graph transformer. Nat Commun, 2023, 14(1): 5618
CrossRef Google scholar
[108]
The RNAcentral Consortium . RNAcentral: a hub of information for non-coding RNA sequences. Nucleic Acids Res, 2019, 47(D1): D221–D229
CrossRef Google scholar
[109]
Theodoris CV , Xiao L , Chopra A , Chaffin MD , Al Sayed ZR , Hill MC , Mantineo H , Brydon EM , Zeng Z , Liu XS , Ellinor PT . Transfer learning enables predictions in network biology. Nature, 2023, 618(7965): 616–624
CrossRef Google scholar
[110]
Tiwari P , Chakrabarty D . Dehydrin in the past four decades: from chaperones to transcription co-regulators in regulating abiotic stress response. Curr Res Biotechnol, 2021, 3: 249–259
CrossRef Google scholar
[111]
Umu SU , Gardner PP . A comprehensive benchmark of RNA–RNA interaction prediction tools for all domains of life. Bioinformatics, 2017, 33(7): 988–996
CrossRef Google scholar
[112]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. pp. 6000–6010
[113]
Verkuil R Kabeli O, Du Y, Wicky BIM, Milles LF, Dauparas J, Baker D, Ovchinnikov S, Sercu T, Rives A (2022) Language models generalize beyond natural proteins. bioRxiv. https://doi.org/10.1101/2022.12.21.521521
[114]
Vickovic S , Eraslan G , Salmén F , Klughammer J , Stenbeck L , Schapiro D , Äijö T , Bonneau R , Bergenstråhle L , Navarro JF , Gould J , Griffin GK , Borg Å , Ronaghi M , Frisén J , Lundeberg J , Regev A , Ståhl PL . High-definition spatial transcriptomics for in situ tissue profiling. Nat Methods, 2019, 16(10): 987–990
CrossRef Google scholar
[115]
Wang B , Luo J , Liu Y , Shi W , Xiong Z , Shen C , Long Y . Spatial-MGCN: a novel multi-view graph convolutional network for identifying spatial domains with attention mechanism. Brief Bioinforms, 2023a, 24(5): bbad262
CrossRef Google scholar
[116]
Wang G , Zhao J , Yan Y , Wang Y , Wu AR , Yang C . Construction of a 3D whole organism spatial atlas by joint modelling of multiple slices with deep neural networks. Nat Mach Intell, 2023b, 5(11): 1200–1213
CrossRef Google scholar
[117]
Wang J , Chen Y , Zou Q . Inferring gene regulatory network from single-cell transcriptomes with graph autoencoder model. PLoS Genet, 2023c, 19(9): e1010942
CrossRef Google scholar
[118]
Wang KC , Chang HY . Molecular mechanisms of long noncoding RNAs. Mol Cell, 2011, 43(6): 904–914
CrossRef Google scholar
[119]
Wang S , Sun S , Li Z , Zhang R , Xu J . Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput Biol, 2017, 13(1): 1005324
CrossRef Google scholar
[120]
Wang W , Feng C , Han R , Wang Z , Ye L , Du Z , Wei H , Zhang F , Peng Z , Yang J . trRosettaRNA: automated prediction of RNA 3D structure with transformer network. Nat Commun, 2023d, 14(1): 7266
CrossRef Google scholar
[121]
Wang W , Peng Z , Yang J . Single-sequence protein structure prediction using supervised transformer protein language models. Nat Comput Sci, 2022, 2(12): 804–814
CrossRef Google scholar
[122]
Wang X, Gu R, Chen Z, Li Y, Ji X, Ke G, Wen H (2023e) UNI-RNA: universal pre-trained models revolutionize RNA research. bioRxiv. https://doi.org/10.1101/2023.07.11.548588
[123]
Wang X , Allen WE , Wright MA , Sylwestrak EL , Samusik N , Vesuna S , Evans K , Liu C , Ramakrishnan C , Liu J , Nolan GP , Bava FA , Deisseroth K . Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science, 2018, 361(6400): eaat5691
CrossRef Google scholar
[124]
Wang X , He Y , Zhang Q , Ren X , Zhang Z . Direct comparative analyses of 10X Genomics Chromium and Smart-seq2. Genomics, Proteomics Bioinformatics, 2021, 19(2): 253–266
CrossRef Google scholar
[125]
Wen H, Tang W, Dai X, Ding J, Jin W, Xie Y, Tang J (2023) CellPLM: pre-training of cell language model beyond single cells. bioRxiv. https://doi.org/10.1101/2023.10.03.560734
[126]
Wu R, Ding F, Wang R, Shen R, Zhang X, Luo S, Su C, Wu Z, Xie Q, Berger B, Ma J, Peng J (2022) High-resolution de novo structure prediction from primary sequence. bioRxiv. https://doi.org/10.1101/2022.07.21.500999
[127]
Wu Z, Pan S, Chen F, Long G, Zhang C, Yu PS (2019) A comprehensive survey on graph neural networks. arXiv. https://doi.org/10.48550/arXiv.1901.00596
[128]
Xu J . Distance-based protein folding powered by deep learning. Proc Natl Acad Sci USA, 2019, 116(34): 16856–16865
CrossRef Google scholar
[129]
Yang F , Wang W , Wang F , Fang Y , Tang D , Huang J , Lu H , Yao J . scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell, 2022, 4(10): 852–866
CrossRef Google scholar
[130]
Yang J , Anishchenko I , Park H , Peng Z , Ovchinnikov S , Baker D . Improved protein structure prediction using predicted interresidue orientations. Proc Natl Acad Sci USA, 2020, 117(3): 1496–1503
CrossRef Google scholar
[131]
Ye C , Zhu J , Wang J , Chen D , Meng L , Zhan Y , Yang R , He S , Li Z , Dai S , Li Y , Sun S , Shen Z , Huang Y , Dong R , Chen G , Zheng S . Single-cell and spatial transcriptomics reveal the fibrosis-related immune landscape of biliary atresia. Clin Transl Med, 2022, 12(11): e1070
CrossRef Google scholar
[132]
Zeira R , Land M , Strzalkowski A , Raphael BJ . Alignment and integration of spatial transcriptomics data. Nat Methods, 2022, 19(5): 567–675
CrossRef Google scholar
[133]
Zhang M , Eichhorn SW , Zingg B , Yao Z , Cotter K , Zeng H , Dong H , Zhuang X . Spatially resolved cell atlas of the mouse primary motor cortex by MERFISH. Nature, 2021, 598(7879): 137–143
CrossRef Google scholar
[134]
Zhang Y , Lang M , Jiang J , Gao Z , Xu F , Litfin T , Chen K , Singh J , Huang X , Song G , Tian Y , Zhan J , Chen J , Zhou Y . Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Res, 2023, 52(1): e3
CrossRef Google scholar
[135]
Zheng J , Zheng Z , Fu C , Weng Y , He A , Ye X , Gao W , Tian R . Deciphering intercellular signaling complexes by interaction-guided chemical proteomics. Nat Communs, 2023, 14(July): 4138
CrossRef Google scholar
[136]
Zhou X , Dong K , Zhang S . Integrating spatial transcriptomics data across different conditions, technologies and developmental stages. Nat Comput Sci, 2023a, 3(10): 894–906
CrossRef Google scholar
[137]
Zhou Z, Ji Y, Li W, Dutta P, Davuluri R, Liu H (2023b) DNABERT-2: efficient foundation model and benchmark for multi-species genome. arXiv. https://doi.org/10.48550/arXiv.2306.15006
[138]
Zhu J , Fan Y , Xiong Y , Wang W , Chen J , Xia Y , Lei J , Gong L , Sun S , Jiang T . Delineating the dynamic evolution from preneoplasia to invasive lung adenocarcinoma by integrating single-cell rna sequencing and spatial transcriptomics. Exp Mol Med, 2022, 54(11): 2060–2076
CrossRef Google scholar
[139]
Zuo C , Zhang Y , Cao C , Feng J , Jiao M , Chen L . Elucidating tumor heterogeneity from spatially resolved transcriptomics data by multi-view graph collaborative learning. Nat Commun, 2022, 13(1): 5962
CrossRef Google scholar

Compliance with ethics guidelines

Conflict of interestYunda Si, Jiawei Zou, Yicheng Gao, Guohui Chuai, Qi Liu and Luonan Chen declare that they have no conflict of interest. Human and animal rights and informed consent This article does not contain any studies with human or animal subjects performed by any of the authors. Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

RIGHTS & PERMISSIONS

2024 The Author(s) 2024. Published by Higher Education Press. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0)
PDF(3107 KB)

Accesses

Citations

Detail

Sections
Recommended

/