A survey on memory-efficient transformer-based model training in AI for science
Kaiyuan TIAN , Linbo QIAO , Baihui LIU , Gongqingjian JIANG , Shanshan LI , Dongsheng LI
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (11) : 2011355
A survey on memory-efficient transformer-based model training in AI for science
Scientific research faces high costs and inefficiencies with traditional methods, but the rise of deep learning and large language models (LLMs) offers innovative solutions. This survey reviews transformer-based LLM applications across scientific fields such as biology, medicine, chemistry, and meteorology, underscoring their role in advancing research. However, the continuous expansion of model size has led to significant memory demands, hindering further development and application of LLMs for science. This survey systematically reviews and categorizes memory-efficient pre-training techniques for large-scale transformers, including algorithm-level, system-level, and hardware-software co-optimization. Taking AlphaFold 2 as an example, we demonstrate how tailored memory optimization methods reduce storage needs while preserving prediction accuracy. By bridging model efficiency and scientific application needs, we hope to provide insights for scalable and cost-effective LLM training in AI for science.
AI for science / memory optimization / large language model / distributed training
| [1] |
Tang S, Yang Y. Why neural networks apply to scientific computing? Theoretical and Applied Mechanics Letters, 2021, 11(3): 100242 |
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
Fang Y, Zhang N, Chen Z, Guo L, Fan X, Chen H. Domain-agnostic molecular generation with chemical feedback. In: Proceedings of the 12th International Conference on Learning Representations. 2024 |
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
Zhuang B, Liu J, Pan Z, He H, Weng Y, Shen C. A survey on efficient training of transformers. In: Proceedings of the 32nd International Joint Conference on Artificial Intelligence. 2023, 764 |
| [22] |
|
| [23] |
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171−4186 |
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
|
| [51] |
|
| [52] |
|
| [53] |
|
| [54] |
|
| [55] |
|
| [56] |
Chen J, Wang X, Ji K, Gao A, Jiang F, Chen S, Zhang H, Song D, Xie W, Kong C, Li J, Wan X, Li H, Wang B. HuatuoGPT-II, one-stage training for medical adaption of LLMs. In: Proceedings of the 1st Conference on Language Modeling. 2024 |
| [57] |
|
| [58] |
|
| [59] |
|
| [60] |
|
| [61] |
|
| [62] |
|
| [63] |
|
| [64] |
|
| [65] |
|
| [66] |
|
| [67] |
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. 2021 |
| [68] |
|
| [69] |
|
| [70] |
|
| [71] |
|
| [72] |
|
| [73] |
|
| [74] |
|
| [75] |
|
| [76] |
|
| [77] |
|
| [78] |
|
| [79] |
|
| [80] |
|
| [81] |
|
| [82] |
|
| [83] |
|
| [84] |
|
| [85] |
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015 |
| [86] |
Loshchilov I, Hutter F. Decoupled weight decay regularization. In: Proceedings of the 7th International Conference on Learning Representations. 2019 |
| [87] |
|
| [88] |
|
| [89] |
|
| [90] |
|
| [91] |
Zhang Y, Chen C, Li Z, Ding T, Wu C, Kingma D P, Ye Y, Luo Z Q, Sun R. Adam-mini: use fewer learning rates to gain more. In: Proceedings of the 13th International Conference on Learning Representations. 2025 |
| [92] |
|
| [93] |
Chen A, Zhang Y, Jia J, Diffenderfer J, Parasyris K, Liu J, Zhang Y, Zhang Z, Kailkhura B, Liu S. DeepZero: scaling up zeroth-order optimization for deep model training. In: Proceedings of the 12th International Conference on Learning Representations. 2024 |
| [94] |
Jiang S, Chen Q, Pan Y, Xiang Y, Lin Y, Wu X, Liu C, Song X. Zo-AdaMU optimizer: adapting perturbation by the momentum and uncertainty in zeroth-order optimization. In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. 2024, 18363−18371 |
| [95] |
|
| [96] |
|
| [97] |
|
| [98] |
|
| [99] |
Choromanski K M, Likhosherstov V, Dohan D, Song X, Gane A, Sarlos T, Hawkins P, Davis J Q, Mohiuddin A, Kaiser L, Belanger D B, Colwell L J, Weller A. Rethinking attention with performers. In: Proceedings of the 9th International Conference on Learning Representations. 2021 |
| [100] |
Li M, Andersen D G, Park J W, Smola A J, Ahmed A, Josifovski V, Long J, Shekita E J, Su B Y. Scaling distributed machine learning with the parameter server. In: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 2014, 583−598 |
| [101] |
|
| [102] |
|
| [103] |
|
| [104] |
|
| [105] |
|
| [106] |
|
| [107] |
|
| [108] |
|
| [109] |
|
| [110] |
Yang B, Zhang J, Li J, Re C, Aberger C, De Sa C. PipeMare: asynchronous pipeline parallel DNN training. In: Proceedings of the 4th Conference on Machine Learning and Systems. 2021, 269−296 |
| [111] |
|
| [112] |
|
| [113] |
|
| [114] |
Qi P, Wan X, Huang G, Lin M. Zero bubble (almost) pipeline parallelism. In: Proceedings of the 12th International Conference on Learning Representations. 2024 |
| [115] |
|
| [116] |
|
| [117] |
|
| [118] |
|
| [119] |
Huang C C, Jin G, Li J. SwapAdvisor: pushing deep learning beyond the GPU memory limit via smart swapping. In: Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. 2020, 1341−1355 |
| [120] |
Rhu M, Gimelshein N, Clemons J, Zulfiqar A, Keckler S W. vDNN: virtualized deep neural networks for scalable, memory-efficient neural network design. In: Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 2016, 1−13 |
| [121] |
|
| [122] |
Ren J, Rajbhandari S, Aminabadi R Y, Ruwase O, Yang S, Zhang M, Li D, He Y. ZeRO-offload: democratizing billion-scale model training. In: Proceedings of 2021 USENIX Annual Technical Conference. 2021, 551−564 |
| [123] |
|
| [124] |
|
| [125] |
Dao T. FlashAttention-2: faster attention with better parallelism and work partitioning. In: Proceedings of the 12th International Conference on Learning Representations. 2024 |
| [126] |
|
| [127] |
|
| [128] |
|
| [129] |
|
| [130] |
|
| [131] |
|
Higher Education Press
/
| 〈 |
|
〉 |