Foundation models for bioinformatics

Ziyu Chen , Lin Wei , Ge Gao

Quant. Biol. ›› 2024, Vol. 12 ›› Issue (4) : 339 -344.

PDF (376KB)
Quant. Biol. ›› 2024, Vol. 12 ›› Issue (4) : 339 -344. DOI: 10.1002/qub2.69
PERSPECTIVE

Foundation models for bioinformatics

Author information +
History +
PDF (376KB)

Abstract

Transformer‐based foundation models such as ChatGPTs have revolutionized our daily life and affected many fields including bioinformatics. In this perspective, we first discuss about the direct application of textual foundation models on bioinformatics tasks, focusing on how to make the most out of canonical large language models and mitigate their inherent flaws. Meanwhile, we go through the transformer‐based, bioinformatics‐tailored foundation models for both sequence and non‐sequence data. In particular, we envision the further development directions as well as challenges for bioinformatics foundation models.

Keywords

ChatGPT / foundation models / large language models / transformer

Cite this article

Download citation ▾
Ziyu Chen, Lin Wei, Ge Gao. Foundation models for bioinformatics. Quant. Biol., 2024, 12(4): 339-344 DOI:10.1002/qub2.69

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the opportunities and risks of foundation models. 2021. Preprint at arXiv: 2108.07258.

[2]

Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A survey of large language models. 2023. Preprint at arXiv: 2303.18223.

[3]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. 2017. Preprint at arXiv: 1706.03762.

[4]

Uszkoreit J. Transformer: a novel neural network architecture for language understanding. Google Research Blog. 2017.

[5]

Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. 2014. Preprint at arXiv: 1409.0473.

[6]

Devlin J, Chang M‐W, Lee K, Toutanova K. BERT: pre‐training of deep bidirectional transformers for language understanding. 2018. Preprint at arXiv: 1810.04805.

[7]

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: a robustly optimized bert pretraining approach. 2019. Preprint at arXiv: 1907.11692.

[8]

Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few‐shot learners. 2020. Preprint at arXiv: 2005.14165.

[9]

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog. 2019; 1: 9.

[10]

Wei J, Bosma M, Zhao VY, Guu K, Yu AW, Lester B, et al. Finetuned language models are zero‐shot learners. 2021. Preprint at arXiv: 2109.01652.

[11]

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, et al. Training language models to follow instructions with human feedback. 2022. Preprint at arXiv: 2203.02155.

[12]

Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M‐A, Lacroix T, et al. LLaMA: open and efficient foundation language models. 2023. Preprint at arXiv: 2302.13971.

[13]

Workshop B, Le Scao T, Fan A, Akiki C, Pavlick E, Ilić S, et al. BLOOM: a 176b‐parameter open‐access multilingual language model. 2022. Preprint at arXiv: 2211.05100.

[14]

Liu H, Ning R, Teng Z, Liu J, Zhou Q, Zhang Y. Evaluating the logical reasoning ability of ChatGPT and GPT‐4. 2023. Preprint at arXiv: 2304.03439.

[15]

Rogers A, Kovaleva O, Rumshisky A. A primer in BERTology: what we know about how BERT works. 2020. Preprint at arXiv: 2002.12327.

[16]

Elicit. Elicit: the AI research assistant. 2023.

[17]

Xiao S, Liu Z, Shao Y, Cao Z. RetroMAE: pre‐training retrieval‐oriented language models via masked auto‐encoder. 2022. Preprint at arXiv: 2205.12035.

[18]

Xiao S, Liu Z, Zhang P, Muennighoff N, Lian D, Nie JY. C‐pack: packaged resources to advance general Chinese embedding. 2023. Preprint at arXiv: 2309.07597.

[19]

OpenAI. OpenAI embeddings guides. 2024.

[20]

Wang J, Cheng Z, Yao Q, Liu L, Xu D, Hu G. Bioinformatics and biomedical informatics with ChatGPT: year one review. Quantitative Biology. 2024; 1- 15.

[21]

Azam M, Chen Y, Arowolo MO, Liu H, Popescu M, Xu D. A comprehensive evaluation of large language models in mining gene relations and pathway knowledge. Quantitative Biology. 2024; 1- 15.

[22]

Hou W, Ji Z. Geneturing tests GPT models in genomics. 2023. Preprint at bioRxiv: 2023.03.11.532238.

[23]

Hou W, Ji Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nat Methods. 2024.

[24]

Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre‐trained biomedical language representation model for biomedical text mining. 2019. Preprint at arXiv: 1901.08746.

[25]

Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, et al. BioGPT: generative pre‐trained transformer for biomedical text generation and mining. 2022. Preprint at arXiv: 2210.10341.

[26]

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023; 620 (7972): 172- 80.

[27]

Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation. 2022. Preprint at arXiv: 2202.03629.

[28]

Tiwari K, Matthews L, May B, Shamovsky V, Orlic‐Milacic M, Rothfels K, et al. ChatGPT usage in the reactome curation process. 2023. Preprint at bioRxiv: 2023.11.08.566195.

[29]

Chen Y, Gao J, Petruc M, Hammer RD, Popescu M, Xu D. Iterative prompt refinement for mining gene relationships from ChatGPT. 2023. Preprint at bioRxiv: 2023.12.23.573201.

[30]

Borgeaud S, Mensch A, Hoffmann J, Cai T, Rutherford E, Millican K, et al. Improving language models by retrieving from trillions of tokens. 2021. Preprint at arXiv: 2112.04426.

[31]

Gao L, Ma X, Lin J, Callan J. Precise zero‐shot dense retrieval without relevance labels. 2022. Preprint at arXiv: 2212.10496.

[32]

Chase H. Langchain. 2022.

[33]

Chowdhury R, Bouatta N, Biswas S, Floristean C, Kharkar A, Roy K, et al. Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol. 2022; 40 (11): 1617- 23.

[34]

Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero‐shot prediction of the effects of mutations on protein function. 2021. Preprint at bioRxiv: 2021.07.09.450648.

[35]

Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA. 2021; 118 (15): e2016239118.

[36]

Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023; 379 (6637): 1123- 30.

[37]

Hsu C, Verkuil R, Liu J, Lin Z, Hie B, Sercu T, et al. Learning inverse folding from millions of predicted structures. 2022. Preprint at bioRxiv: 2022.04.10.487779.

[38]

Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics. 2021; 37 (15): 2112- 20.

[39]

Benegas G, Batra SS, Song YS. DNA language models are powerful predictors of genome-wide variant effects. Proc Natl Acad Sci USA. 2023; 120 (44): e2311219120.

[40]

Dalla‐Torre H, Gonzalez L, Mendoza‐Revilla J, Carranza NL, Grzywaczewski AH, Oteri F, et al. The nucleotide transformer: building and evaluating robust foundation models for human genomics. 2023. Preprint at bioRxiv: 2023.01.11.523679.

[41]

Zhou Z, Ji Y, Li W, Dutta P, Davuluri R, Liu H. DNABERT‐2: efficient foundation model and benchmark for multi‐species genome. 2023. Preprint at arXiv: 2306.15006.

[42]

Nguyen E, Poli M, Durrant MG, Thomas AW, Kang B, Sullivan J, et al. Sequence modeling and design from molecular to genome scale with Evo. 2024. Preprint at bioRxiv: 2024.02.27.582234.

[43]

Chen J, Hu Z, Sun S, Tan Q, Wang Y, Yu Q, et al. Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. 2022. Preprint at arXiv: 2204.00300.

[44]

Wang X, Gu R, Chen Z, Li Y, Ji X, Ke G, et al. UNI‐RNA: universal pre‐trained models revolutionize RNA research. 2023. Preprint at bioRxiv. 2023.2007.2011.548588.

[45]

Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023; 41 (8): 1099- 106.

[46]

Nijkamp E, Ruffolo JA, Weinstein EN, Naik N, Madani A. ProGen2: exploring the boundaries of protein language models. Cell Syst. 2023; 14 (11): 968- 78.e963.

[47]

Ferruz N, Schmidt S, Höcker B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun. 2022; 13 (1): 4348.

[48]

Ferruz N, Höcker B. Controllable protein design with language models. Nat Mach Intell. 2022; 4 (6): 521- 32.

[49]

Winnifrith A, Outeiral C, Hie BL. Generative artificial intelligence for de novo protein design. Curr Opin Struct Biol. 2024; 86: 102794.

[50]

Rao R, Liu J, Verkuil R, Meier J, Canny JF, Abbeel P, et al. MSA transformer. 2021. Preprint at bioRxiv: 2021.02.12.430858.

[51]

Zheng K, Long S, Lu T, Yang J, Dai X, Zhang M, et al. ESM all‐atom: multi‐scale protein language model for unified molecular modeling. 2024. Preprint at arXiv: 2403.12995.

[52]

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: transformers for image recognition at scale. 2020. Preprint at arXiv: 2010.11929.

[53]

Chen M, Radford A, Child R, Wu J, Jun H, Luan D, et al. Generative pretraining from pixels. In: Proceedings of the 37th international conference on machine learning. 2020; 119: 1691- 703.

[54]

Wang S, Guo Y, Wang Y, Sun H, Huang J. Smiles‐bert: large scale unsupervised pre‐training for molecular property prediction. In: Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics; 2019. p. 429- 36.

[55]

Rong Y, Bian Y, Xu T, Xie W, Wei Y, Huang W, et al. Self‐supervised graph transformer on large‐scale molecular data. In: Proceedings of the 34th international conference on neural information processing systems; 2020. Article 1053.

[56]

Kalakoti Y, Yadav S, Sundar D. TransDTI: transformer-based language models for estimating DTIs and building a drug recommendation workflow. ACS Omega. 2022; 7 (3): 2706- 17.

[57]

Hao M, Wei L, Yang F, Yao J, Theodoris CV, Wang B, et al. Current opinions on large cellular models. Quantitative Biology. 2024; 1- 11.

[58]

Theodoris CV, Xiao L, Chopra A, Chaffin MD, Al Sayed ZR, Hill MC, et al. Transfer learning enables predictions in network biology. Nature. 2023; 618 (7965): 616- 24.

[59]

Yang X, Liu G, Feng G, Bu D, Wang P, Jiang J, et al. Genecompass: deciphering universal gene regulatory mechanisms with knowledge‐informed cross‐species foundation model. 2023. Preprint at bioRxiv: 2023.09.26.559542.

[60]

Schaar AC, Tejada‐Lapuerta A, Palla G, Gutgesell R, Halle L, Minaeva M, et al. Nicheformer: a foundation model for single‐cell and spatial omics. 2024. Preprint at bioRxiv: 2024.04.15.589472.

[61]

Yang F, Wang W, Wang F, Fang Y, Tang D, Huang J, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell. 2022; 4 (10): 852- 66.

[62]

Cui H, Wang C, Maan H, Pang K, Luo F, Duan N, et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods. 2024; 21: 1470- 80.

[63]

Hao M, Gong J, Zeng X, Liu C, Guo Y, Cheng X, et al. Large scale foundation model on single‐cell transcriptomics. 2023. Preprint at bioRxiv: 2023.05.29.542705.

[64]

Gong J, Hao M, Zeng X, Liu C, Ma J, Cheng X, et al. xTrimoGene: an efficient and scalable representation learner for single‐cell RNA‐seq data. 2023. Preprint at bioRxiv: 2023.03.24.534055.

[65]

Chen Y, Zou, J. GenePT: a simple but effective foundation model for genes and cells built from ChatGPT. 2024. Preprint at bioRxiv: 2023.10.16.562533.

[66]

Liu T, Chen T, Zheng W, Luo X, Zhao H. scELMo: embeddings from language models are good learners for single‐cell data analysis. 2023. Preprint at bioRxiv: 2023.12.07.569910.

[67]

Jain S, Wallace BC. Attention is not explanation. 2019. Preprint at arXiv: 1902.10186.

[68]

Abnar S, Zuidema W. Quantifying attention flow in transformers. 2020. Preprint at arXiv: 2005.00928.

[69]

Dao T, Fu DY, Ermon S, Rudra A, C. Flashattention: fast and memory‐efficient exact attention with IO‐awareness. 2022. Preprint at arXiv: 2205.14135.

[70]

Child R, Gray S, Radford A, Sutskever I. Generating long sequences with sparse transformers. 2019. Preprint at arXiv: 1904.10509.

[71]

Zaheer M, Guruganesh G, Dubey A, Ainslie J, Alberti C, Ontanon S, et al. Big bird: transformers for longer sequences. 2020. Preprint at arXiv: 2007.14062.

[72]

Choromanski K, Likhosherstov V, Dohan D, Song X, Gane A, Sarlos T, et al. Rethinking attention with performers. 2020. Preprint at arXiv: 2009.14794.

[73]

Peng B, Alcaide E, Anthony Q, Albalak A, Arcadinho S, Biderman S, et al. Rwkv: reinventing RNNs for the transformer era. 2023. Preprint at arXiv: 2305.13048.

[74]

Poli M, Massaroli S, Nguyen E, Fu DY, Dao T, Baccus S, et al. Hyena hierarchy: towards larger convolutional language models. 2023. Preprint at arXiv: 2302.10866.

[75]

Gu A, Dao T. Mamba: linear‐time sequence modeling with selective state spaces. 2023. Preprint at arXiv: 2312.00752.

[76]

Nguyen E, Poli M, Faizi M, Thomas A, Birch‐Sykes C, Wornow M, et al. HyenaDNA: long‐range genomic sequence modeling at single nucleotide resolution. 2023. Preprint at arXiv: 2306.15794.

[77]

Sutton R. The bitter lesson. 2019.

[78]

Kaplan J, McCandlish S, Henighan T, Brown TB, Chess B, Child R, et al. Scaling laws for neural language models. 2020. Preprint at arXiv: 2001.08361.

[79]

Hoffmann J, Borgeaud S, Mensch A, Buchatskaya E, Cai T, Rutherford E, et al. Training compute‐optimal large language models. 2022. Preprint at arXiv: 2203.15556.

[80]

Wei J, Tay Y, Bommasani R, Raffel C, Zoph B, Borgeaud S, et al. Emergent abilities of large language models. 2022. Preprint at arXiv: 2206.07682.

RIGHTS & PERMISSIONS

2024 The Author(s). Quantitative Biology published by John Wiley & Sons Australia, Ltd on behalf of Higher Education Press.

AI Summary AI Mindmap
PDF (376KB)

897

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/