Perspectives on benchmarking foundation models for network biology
Christina V. Theodoris
Perspectives on benchmarking foundation models for network biology
Transfer learning has revolutionized fields including natural language understanding and computer vision by leveraging large‐scale general datasets to pretrain models with foundational knowledge that can then be transferred to improve predictions in a vast range of downstream tasks. More recently, there has been a growth in the adoption of transfer learning approaches in biological fields, where models have been pretrained on massive amounts of biological data and employed to make predictions in a broad range of biological applications. However, unlike in natural language where humans are best suited to evaluate models given a clear understanding of the ground truth, biology presents the unique challenge of being in a setting where there are a plethora of unknowns while at the same time needing to abide by real‐world physical constraints. This perspective provides a discussion of some key points we should consider as a field in designing benchmarks for foundation models in network biology.
benchmarking strategy / foundation models / network biology / transfer learning
[1] |
Vaswani A . Attention is all you need. Adv Neural Inf Process Syst. 2017.
|
[2] |
Devlin J , Chang MW , Lee K , Toutanova K . BERT: pre‐training of deep bidirectional transformers for language understanding. In: Naacl HLT 2019-2019 conference of the North American chapter of the association for computational linguistics: human language technologies-proceedings of the conference, 1; 2019. p. 4174- 86.
|
[3] |
He K , Zhang X , Ren S , Sun J . Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition; 2016. p. 770- 8.
|
[4] |
Fudenberg G , Kelley DR , Pollard KS . Predicting 3D genome folding from DNA sequence with Akita. Nat Methods. 2020; 17 (11): 1111- 7.
|
[5] |
Jumper J , Evans R , Pritzel A , Green T , Figurnov M , Ronneberger O , et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021; 596 (7873): 583- 9.
|
[6] |
Avsec Ž , Agarwal V , Visentin D , Ledsam JR , Grabska-Barwinska A , Taylor KR , et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021; 18 (10): 1196- 203.
|
[7] |
Yang F , Wang W , Wang F , Fang Y , Tang D , Huang J , et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell. 2022; 4 (10): 852- 66.
|
[8] |
Lin Z , Akin H , Rao R , Hie B , Zhu Z , Lu W , et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023: 379.
|
[9] |
Theodoris CV , Xiao L , Chopra A , Chaffin MD , Al Sayed ZR , Hill MC , et al. Transfer learning enables predictions in network biology. Nature. 2023; 618 (7965): 616- 24.
|
[10] |
Shen H , Liu J , Hu J , Shen X , Zhang C , Wu D , et al. Generative pretraining from large-scale transcriptomes for single-cell deciphering. iScience. 2023; 26 (5): 106536.
|
[11] |
Cui H , Wang C , Maan H , Pang K , Luo F , Duan N , et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods. 2024: 1- 11.
|
[12] |
Hao M , Gong J , Zeng X , Liu C , Guo Y , Cheng X , et al. Large-scale foundation model on single-cell transcriptomics. Nat Methods. 2024: 1- 11.
|
[13] |
Nguyen E , Poli M , Faizi M , Thomas A , Birch‐Sykes C , Wornow M , et al. HyenaDNA: long‐range genomic sequence modeling at single nucleotide resolution. 2023. Preprint at arXiv:2306.15794.
|
[14] |
Linder J , Srivastava D , Yuan H , Agarwal V , Kelley DR . Predicting RNA‐seq coverage from DNA sequence as a unifying model of gene regulation. 2023. Preprint at bioRxiv:2023.08.30.555582.
|
[15] |
Yang X , Liu G , Feng G , Bu D , Wang P , Jiang J , et al. GeneCompass: deciphering universal gene regulatory mechanisms with knowledge‐informed cross‐species foundation model. 2023. Preprint at bioRxiv:2023.09.26.559542.
|
[16] |
Rosen Y , Roohani Y , Agarwal A , Samotorčan L , Tabula Sapiens Consortium , Quake SR , et al. Universal cell embeddings: a foundation model for cell biology. 2023. Preprint at bioRxiv:2023.11.28.568918.
|
[17] |
Bian H , Chen Y , Dong X , Li C , Hao M , Chen S , et al. scMulan: a multitask generative pre‐trained language model for single‐cell analysis. 2024. Preprint at bioRxiv:2024.01.25.577152.
|
[18] |
Nguyen E , Poli M , Durrant MG , Thomas AW , Kang B , Sullivan J , et al. Sequence modeling and design from molecular to genome scale with Evo. 2024. Preprint at bioRxiv:2024.02.27.582234.
|
[19] |
Schaar AC , Tejada‐Lapuerta A , Palla G , Gutgesell R , Halle L , Minaeva M , et al. Nicheformer: a foundation model for single‐cell and spatial omics. 2024. Preprint at bioRxiv:2024.04.15.589472.
|
[20] |
Liu T , Li K , Wang Y , Li H , Zhao H . Evaluating the utilities of foundation models in single‐cell data analysis. 2023. Preprint at bioRxiv:2023.09.08.555192.
|
[21] |
Wei J , Tay Y , Bommasani R , Raffel C , Zoph B , Borgeaud S , et al. Emergent abilities of large language models. 2022. Preprint at arXiv:2206.07682.
|
/
〈 | 〉 |