A survey on memory-efficient transformer-based model training in AI for science

Kaiyuan TIAN , Linbo QIAO , Baihui LIU , Gongqingjian JIANG , Shanshan LI , Dongsheng LI

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (11) : 2011355

PDF (2228KB)
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (11) : 2011355 DOI: 10.1007/s11704-025-50302-6
Artificial Intelligence
REVIEW ARTICLE

A survey on memory-efficient transformer-based model training in AI for science

Author information +
History +
PDF (2228KB)

Abstract

Scientific research faces high costs and inefficiencies with traditional methods, but the rise of deep learning and large language models (LLMs) offers innovative solutions. This survey reviews transformer-based LLM applications across scientific fields such as biology, medicine, chemistry, and meteorology, underscoring their role in advancing research. However, the continuous expansion of model size has led to significant memory demands, hindering further development and application of LLMs for science. This survey systematically reviews and categorizes memory-efficient pre-training techniques for large-scale transformers, including algorithm-level, system-level, and hardware-software co-optimization. Taking AlphaFold 2 as an example, we demonstrate how tailored memory optimization methods reduce storage needs while preserving prediction accuracy. By bridging model efficiency and scientific application needs, we hope to provide insights for scalable and cost-effective LLM training in AI for science.

Graphical abstract

Keywords

AI for science / memory optimization / large language model / distributed training

Cite this article

Download citation ▾
Kaiyuan TIAN, Linbo QIAO, Baihui LIU, Gongqingjian JIANG, Shanshan LI, Dongsheng LI. A survey on memory-efficient transformer-based model training in AI for science. Front. Comput. Sci., 2026, 20(11): 2011355 DOI:10.1007/s11704-025-50302-6

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Tang S, Yang Y. Why neural networks apply to scientific computing? Theoretical and Applied Mechanics Letters, 2021, 11(3): 100242

[2]

Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl S A A, Ballard A J, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior A W, Kavukcuoglu K, Kohli P, Hassabis D . Highly accurate protein structure prediction with AlphaFold. Nature, 2021, 596( 7873): 583–589

[3]

Singhal K, Azizi S, Tu T, Mahdavi S S, Wei J, Chung H W, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Agüera y Arcas B, Webster D, Corrado G S, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V . Large language models encode clinical knowledge. Nature, 2023, 620( 7972): 172–180

[4]

Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, . . Toward expert-level medical question answering with large language models. Nature Medicine, 2025, 31( 3): 943–950

[5]

Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang P C, Carroll A, Lau C, Tanno R, Ktena I, Mustafa B, Chowdhery A, Liu Y, Kornblith S, Fleet D, Mansfield P, Prakash S, Wong R, Virmani S, Semturs C, Mahdavi S S, Green B, Dominowska E, Aguera y Arcas B, Barral J, Webster D, Corrado G S, Matias Y, Singhal K, Florence P, Karthikesalingam A, Natarajan V. Towards generalist biomedical AI. 2023, arXiv preprint arXiv: 2307.14334

[6]

Frey N C, Soklaski R, Axelrod S, Samsi S, Gómez-Bombarelli R, Coley C W, Gadepally V . Neural scaling of deep chemical models. Nature Machine Intelligence, 2023, 5( 11): 1297–1305

[7]

Fang Y, Zhang N, Chen Z, Guo L, Fan X, Chen H. Domain-agnostic molecular generation with chemical feedback. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[8]

Ock J, Guntuboina C, Barati Farimani A . Catalyst energy prediction with CatBERTa: unveiling feature exploration strategies through large language models. ACS Catalysis, 2023, 13( 24): 16032–16044

[9]

Irwin R, Dimitriadis S, He J, Bjerrum E J . Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology, 2022, 3( 1): 015022

[10]

Bi K, Xie L, Zhang H, Chen X, Gu X, Tian Q. Pangu-weather: a 3D high-resolution model for fast and accurate global weather forecast. 2022, arXiv preprint arXiv: 2211.02556

[11]

Bi K, Xie L, Zhang H, Chen X, Gu X, Tian Q . Accurate mediumrange global weather forecasting with 3D neural networks. Nature, 2023, 619( 7970): 533–538

[12]

Chen L, Zhong X, Zhang F, Cheng Y, Xu Y, Qi Y, Li H . FuXi: a cascade machine learning forecasting system for 15-day global weather forecast. npj Climate and Atmospheric Science, 2023, 6( 1): 190

[13]

Nguyen T, Brandstetter J, Kapoor A, Gupta J K, Grover A. ClimaX: a foundation model for weather and climate. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 25904−25938

[14]

Bodnar C, Bruinsma W P, Lucic A, Stanley M, Brandstetter J, Garvan P, Riechert M, Weyn J, Dong H, Vaughan A, Gupta J K, Tambiratnam, Archibald A, Heider E, Welling M, Turner R E, Perdikaris P. Aurora: a foundation model of the atmosphere. 2024, arXiv preprint arXiv: 2405.13063

[15]

Kaplan J, McCandlish S, Henighan T, Brown T B, Chess B, Child R, Gray S Radford A, Wu J, Amodei D. Scaling laws for neural language models. 2020, arXiv preprint arXiv: 2001.08361

[16]

Hoffmann J, Borgeaud S, Mensch A, Buchatskaya E, Cai T, Rutherford E, de Las Casas D, Hendricks L A, Welbl J, Clark A, Hennigan T, Noland E, Millican K, van den Driessche G, Damoc B, Guy A, Osindero S, Simonyan K, Elsen E, Vinyals O, Rae J, Sifre L. An empirical analysis of compute-optimal large language model training. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2176

[17]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010

[18]

Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, , . Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 159

[19]

Touvron H, Martin L, Stone K, Albert P, Almahairi A, , . Llama 2: open foundation and fine-tuned chat models. 2023, arXiv preprint arXiv:2307.09288

[20]

Gholami A, Yao Z, Kim S, Hooper C, Mahoney M W, Keutzer K . AI and memory wall. IEEE Micro, 2024, 44( 3): 33–39

[21]

Zhuang B, Liu J, Pan Z, He H, Weng Y, Shen C. A survey on efficient training of transformers. In: Proceedings of the 32nd International Joint Conference on Artificial Intelligence. 2023, 764

[22]

Wan Z, Wang X, Liu C, Alam S, Zheng Y, Liu J, Qu Z, Yan S, Zhu Y, Zhang Q, Chowdhury M, Zhang M. Efficient large language models: a survey. 2024, arXiv preprint arXiv: 2312.03863

[23]

Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171−4186

[24]

Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. OpenAI Blog, 2018

[25]

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I . Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1( 8): 9

[26]

Chung H W, Hou L, Longpre S, Zoph B, Tai Y, . . Scaling instruction-finetuned language models. The Journal of Machine Learning Research, 2024, 25( 1): 70

[27]

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C L, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, Lowe R. Training language models to follow instructions with human feedback. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2011

[28]

Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, . . PaLM: scaling language modeling with pathways. The Journal of Machine Learning Research, 2023, 24( 1): 240

[29]

Anil R, Dai A M, Firat O, Johnson M, Lepikhin D, , . PaLM 2 technical report. 2023, arXiv preprint arXiv: 2305.10403

[30]

Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971

[31]

Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, , . The Llama 3 herd of models. 2024, arXiv preprint arXiv: 2407.21783

[32]

OpenAI. GPT-4 technical report. 2024, arXiv preprint arXiv: 2303.08774

[33]

Bi X, Chen D, Chen G, Chen S, Dai D, , . DeepSeek LLM: scaling open-source language models with longtermism. 2024, arXiv preprint arXiv: 2401.02954

[34]

DeepSeek-AI. DeepSeek-V2: a strong, economical, and efficient mixture-of-experts language model. 2024, arXiv preprint arXiv: 2405.04434

[35]

DeepSeek-AI. DeepSeek-V3 technical report. 2024, arXiv preprint arXiv: 2412.19437

[36]

Team GLM. ChatGLM: a family of large language models from GLM-130B to GLM-4 all tools. 2024, arXiv preprint arXiv: 2406.12793

[37]

Bai J, Bai S, Chu Y, Cui Z, Dang K, , . Qwen technical report. 2023, arXiv preprint arXiv: 2309.16609

[38]

Yang A, Yang B, Hui B, Zheng B, Yu B, , . Qwen2 technical report. 2024, arXiv preprint arXiv: 2407.10671

[39]

Chen Z, Cano A H, Romanou A, Bonnet A, Matoba K, , . MEDITRON-70B: scaling medical pretraining for large language models. 2023, arXiv preprint arXiv: 2311.16079

[40]

Baek M, Dimaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee G R, Wang J, Cong Q, Kinch L N, Schaeffer R D, Millán C, Park H, Adams C, Glassman C R, Degiovanni A, Pereira J H, Rodrigues A V, Van Dijk A A, Ebrecht A C, Opperman D J, Sagmeister T, Buhlheller C, Pavkov-Keller T, Rathinaswamy M K, Dalwadi U, Yip C K, Burke J E, Garcia K C, Grishin N V, Adams P D, Read R J, Baker D . Accurate prediction of protein structures and interactions using a three-track neural network. Science, 2021, 373( 6557): 871–876

[41]

Fuchs F B, Worrall D E, Fischer V, Welling M. SE(3)-transformers: 3D roto-translation equivariant attention networks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 166

[42]

Abramson J, Adler J, Dunger J, Evans R, Green T, . . Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 2024, 630( 8016): 493–500

[43]

Ahdritz G, Bouatta N, Floristean C, Kadyan S, Xia Q, Gerecke W, O’Donnell T J, Berenberg D, Fisk I, Zanichelli N, Zhang B, Nowaczynski A, Wang B, Stepniewska-Dziubinska M M, Zhang S, Ojewole A, Guney M E, Biderman S, Watkins A M, Ra S, Lorenzo P R, Nivon L, Weitzner B, Ban Y E A, Chen S, Zhang M, Li C, Song S, He Y, Sorger P K, Mostaque E, Zhang Z, Bonneau R, Alquraishi M . OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nature Methods, 2024, 21( 8): 1514–1524

[44]

Cheng S, Zhao X, Lu G, Fang J, Zheng T, Wu R, Zhang X, Peng J, You Y. FastFold: optimizing AlphaFold training and inference on GPU clusters. In: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 2024, 417−430

[45]

Zhu F, Nowaczynski A, Li R, Xin J, Song Y, Marcinkiewicz M, Eryilmaz S B, Yang J, Andersch M. ScaleFold: reducing AlphaFold initial training time to 10 hours. In: Proceedings of the 61st ACM/IEEE Design Automation Conference. 2024, 265

[46]

Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y, Dos Santos costa A, Fazel-Zarandi M, Sercu T, Candido S, Rives A . Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 2023, 379( 6637): 1123–1130

[47]

Zhao Y, Gu A, Varma R, Luo L, Huang C C, Xu M, Wright L, Shojanazeri H, Ott M, Shleifer S, Desmaison A, Balioglu C, Damania P, Nguyen B, Chauhan G, Hao Y, Mathews A, Li S . PyTorch FSDP: experiences on scaling Fully Sharded Data Parallel. Proceedings of the VLDB Endowment, 2023, 16( 12): 3848–3860

[48]

Chen B, Cheng X, Li P, Geng Y A, Gong J, Li S, Bei Z, Tan X, Wang B, Zeng, X, Liu C, Zeng A, Dong Y, Tang J, Song L. xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein. 2024, arXiv preprint arXiv: 2401.06199

[49]

Hayes T, Rao R, Akin H, Sofroniew N J, Oktay D, Lin Z, Verkuil R, Tran V Q, Deaton J, Wiggert M, Badkundri R, Shafkat I, Gong J, Derry A, Molina R S, Thomas N, Khan Y A, Mishra C, Kim C, Bartie L J, Nemeth M, Hsu P D, Sercu T, Candido S, Rives A . Simulating 500 million years of evolution with a language model. Science, 2025, 387( 6736): 850–858

[50]

Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, Liu T Y . BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 2022, 23( 6): bbac409

[51]

Driess D, Xia F, Sajjadi M S M, Lynch C, Chowdhery A, , . PaLM-E: an embodied multimodal language model. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 340

[52]

Zhang K, Zhou R, Adhikarla E, Yan Z, Liu Y, Yu J, Liu Z, Chen X, Davison B D, Ren H, Huang J, Chen C, Zhou Y, Fu S, Liu W, Liu T, Li X, Chen Y, He L, Zou J, Li Q, Liu H, Sun L . A generalist vision−language foundation model for diverse biomedical tasks. Nature Medicine, 2024, 30( 11): 3129–3141

[53]

Wang P, Yang A, Men R, Lin J, Bai S, Li Z, Ma J, Zhou C, Zhou J, Yang H. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 23318−23340

[54]

Zhang H, Chen J, Jiang F, Yu F, Chen Z, Chen G, Li J, Wu X, Zhang Z, Xiao Q, Wan X, Wang B, Li H. HuatuoGPT, towards taming language model to be a doctor. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2023. 2023, 10859−10885

[55]

Zhang J, Gan R, Wang J, Zhang Y, Zhang L, Yang P, Gao X, Wu Z, Dong X, He J, Zhuo J, Yang Q, Huang Y, Li X, Wu Y, Lu J, Zhu X, Chen W, Han T, Pan K, Wang R, Wang H, Wu X, Zeng Z, Chen C. Fengshenbang 1.0: being the foundation of Chinese cognitive intelligence. 2023, arXiv preprint arXiv: 2209.02970

[56]

Chen J, Wang X, Ji K, Gao A, Jiang F, Chen S, Zhang H, Song D, Xie W, Kong C, Li J, Wan X, Li H, Wang B. HuatuoGPT-II, one-stage training for medical adaption of LLMs. In: Proceedings of the 1st Conference on Language Modeling. 2024

[57]

Yang A, Xiao B, Wang B, Zhang B, Bian C, , . Baichuan 2: open large-scale language models. 2023, arXiv preprint arXiv: 2309.10305

[58]

01.AI. Yi: open foundation models by 01.AI. 2024, arXiv preprint arXiv: 2403.04652

[59]

ValizadehAslani T, Shi Y, Ren P, Wang J, Zhang Y, Hu M, Zhao L, Liang H . PharmBERT: a domain-specific BERT model for drug labels. Briefings in Bioinformatics, 2023, 24( 4): bbad226

[60]

Chen L, Wang W, Bai Z, Xu P, Fang Y, , . PharmaGPT: domain-specific large language models for bio-pharmaceutical and chemistry. 2024, arXiv preprint arXiv: 2406.18045

[61]

Guo J, Ibanez-Lopez A S, Gao H, Quach V, Coley C W, Jensen K F, Barzilay R . Automated chemical reaction extraction from scientific literature. Journal of Chemical Information and Modeling, 2022, 62( 9): 2035–2045

[62]

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: a robustly optimized BERT pretraining approach. 2019, arXiv preprint arXiv: 1907.11692

[63]

Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7871−7880

[64]

Black S, Leo G, Wang P, Leahy C, Biderman S. GPT-Neo: large scale autoregressive language modeling with mesh-tensorflow. See zenodo.org/records/5297715 website, 2021

[65]

Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 9992−10002

[66]

Liu Z, Hu H, Lin Y, Yao Z, Xie Z, Wei Y, Ning J, Cao Y, Zhang Z, Dong L, Wei F, Guo B. Swin transformer V2: scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 11999−12009

[67]

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[68]

Deng C, Zhang T, He Z, Chen Q, Shi Y, Xu Y, Fu L, Zhang W, Wang X, Zhou C, Lin Z, He J. K2: a foundation language model for geoscience knowledge understanding and utilization. In: Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 2024, 161−170

[69]

Lin Z, Deng C, Zhou L, Zhang T, Xu Y, Xu Y, He Z, Shi Y, Dai B, Song Y, Zeng B, Chen Q, Miao Y, Xue B, Wang S, Fu L, Zhang W, He J, Zhu Y, Wang X, Zhou C. GeoGalactica: a scientific large language model in geoscience. 2024, arXiv preprint arXiv: 2401.00434

[70]

Taylor R, Kardas M, Cucurull G, Scialom T, Hartshorn A, Saravia E, Poulton A, Kerkez V, Stojnic R. Galactica: a large language model for science. 2022, arXiv preprint arXiv: 2211.09085

[71]

Chen Z, Lin M, Zang M, Wang Z, Li J, Bai Y . Jiuzhou: open foundation language models and effective pre-training framework for geoscience. International Journal of Digital Earth, 2025, 18( 1): 2449708

[72]

Jiang A Q, Sablayrolles A, Mensch A, Bamford C, Chaplot D S, de las Casas D, Bressand F, Lengyel G, Lample G, Saulnier L, Lavaud L R, Lachaux M A, Stock P, Le Scao T, Lavril T, Wang T, Lacroix T, El Sayed W. Mistral 7B. 2023, arXiv preprint arXiv:2310.06825

[73]

Yang Z, Zeng X, Zhao Y, Chen R . AlphaFold2 and its applications in the fields of biology and medicine. Signal Transduction and Targeted Therapy, 2023, 8( 1): 115

[74]

ECMWF. The modelling infrastructure of the integrated forecasting system: recent advances and future challenges. See ecmwf.int/en/elibrary/78758-modelling-infrastructure-integrated-forecasting-system-recent-advances-and-future website, 2015

[75]

Bommasani R, Hudson D A, Adeli E, Altman R, Arora S, , . On the opportunities and risks of foundation models. 2022, arXiv preprint arXiv: 2108.07258

[76]

Micikevicius P, Narang S, Alben J, Diamos G, Elsen E, Garcia D, Ginsburg B, Houston M, Kuchaiev O, Venkatesh G, Wu H. Mixed precision training. 2018, arXiv preprint arXiv: 1710.03740

[77]

Kalamkar D, Mudigere D, Mellempudi N, Das D, Banerjee K, , . A study of BFLOAT16 for deep learning training. 2019, arXiv preprint arXiv: 1905.12322

[78]

BigScience Workshop. BLOOM: a 176B-parameter open-access multilingual language model. 2023, arXiv preprint arXiv: 2211.05100

[79]

Micikevicius P, Stosic D, Burgess N, Cornea M, Dubey P, Grisenthwaite R, Ha S, Heinecke A, Judd P, Kamalu J, Mellempudi N, Oberman S, Shoeybi M, Siu M, Wu H. FP8 formats for deep learning. 2022, arXiv preprint arXiv: 2209.05433

[80]

Peng H, Wu K, Wei Y, Zhao G, Yang Y, , . FP8-LM: training FP8 large language models. 2023, arXiv preprint arXiv: 2310.18313

[81]

Liu Z, Oguz B, Zhao C, Chang E, Stock P, Mehdad Y, Shi Y, Krishnamoorthi R, Chandra V. LLM-QAT: data-free quantization aware training for large language models. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2024. 2024, 467−484

[82]

Wang H, Ma S, Dong L, Huang S, Wang H, Ma L, Yang F, Wang R, Wu Y, Wei F. BitNet: scaling 1-bit transformers for large language models. 2023, arXiv preprint arXiv: 2310.11453

[83]

Ma S, Wang H, Ma L, Wang L, Wang W, Huang S, Dong L, Wang R, Xue J, Wei F. The era of 1-bit LLMs: all large language models are in 1.58 bits. 2024, arXiv preprint arXiv: 2402.17764

[84]

Chen M, Shao W, Xu P, Wang J, Gao P, Zhang K, Luo P. EfficientQAT: efficient quantization-aware training for large language models. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, 10081−10100

[85]

Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015

[86]

Loshchilov I, Hutter F. Decoupled weight decay regularization. In: Proceedings of the 7th International Conference on Learning Representations. 2019

[87]

Shazeer N, Stern M. Adafactor: adaptive learning rates with sublinear memory cost. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 4596−4604

[88]

Anil R, Gupta V, Koren T, Singer Y. Memory-efficient adaptive optimization. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 874

[89]

Luo Y, Ren X, Zheng Z, Jiang Z, Jiang X, You Y. CAME: confidence-guided adaptive memory efficient optimization. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 4442−4453

[90]

Chen X, Liang C, Huang D, Real E, Wang K, Pham H, Dong X, Luong T, Hsieh C J, Lu Y, Le Q V. Symbolic discovery of optimization algorithms. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2140

[91]

Zhang Y, Chen C, Li Z, Ding T, Wu C, Kingma D P, Ye Y, Luo Z Q, Sun R. Adam-mini: use fewer learning rates to gain more. In: Proceedings of the 13th International Conference on Learning Representations. 2025

[92]

Malladi S, Gao T, Nichani E, Damian A, Lee J D, Chen D, Arora S. Fine-tuning language models with just forward passes. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2308

[93]

Chen A, Zhang Y, Jia J, Diffenderfer J, Parasyris K, Liu J, Zhang Y, Zhang Z, Kailkhura B, Liu S. DeepZero: scaling up zeroth-order optimization for deep model training. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[94]

Jiang S, Chen Q, Pan Y, Xiang Y, Lin Y, Wu X, Liu C, Song X. Zo-AdaMU optimizer: adapting perturbation by the momentum and uncertainty in zeroth-order optimization. In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. 2024, 18363−18371

[95]

Chen T, Xu B, Zhang C, Guestrin C. Training deep nets with sublinear memory cost. 2016, arXiv preprint arXiv: 1604.06174

[96]

Narayanan D, Shoeybi M, Casper J, Legresley P, Patwary M, Korthikanti V, Vainbrand D, Kashinkunti P, Bernauer J, Catanzaro B, Phanishayee A, Zaharia M. Efficient large-scale language model training on GPU clusters using megatron-LM. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis2021, 58

[97]

Korthikanti V A, Casper J, Lym S, McAfee L, Andersch M, Shoeybi M, Catanzaro B. Reducing activation recomputation in large transformer models. In: Proceedings of Machine Learning and Systems. 2023, 341−353

[98]

Wang S, Li B Z, Khabsa M, Fang H, Ma H. Linformer: self-attention with linear complexity. 2020, arXiv preprint arXiv: 2006.04768

[99]

Choromanski K M, Likhosherstov V, Dohan D, Song X, Gane A, Sarlos T, Hawkins P, Davis J Q, Mohiuddin A, Kaiser L, Belanger D B, Colwell L J, Weller A. Rethinking attention with performers. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[100]

Li M, Andersen D G, Park J W, Smola A J, Ahmed A, Josifovski V, Long J, Shekita E J, Su B Y. Scaling distributed machine learning with the parameter server. In: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 2014, 583−598

[101]

Li S, Zhao Y, Varma R, Salpekar O, Noordhuis P, Li T, Paszke A, Smith J, Vaughan B, Damania P, Chintala S . PyTorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 2020, 13( 12): 3005–3018

[102]

Rajbhandari S, Rasley J, Ruwase O, He Y. ZeRO: memory optimizations toward training trillion parameter models. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2020, 20

[103]

Rasley J, Rajbhandari S, Ruwase O, He Y. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, 3505−3506

[104]

Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-LM: training multi-billion parameter language models using model parallelism. 2020, arXiv preprint arXiv: 1909.08053

[105]

Xu Q, You Y. An efficient 2D method for training super-large deep learning models. In: Proceedings of 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2023, 222−232

[106]

Wang B, Xu Q, Bian Z, You Y. Tesseract: parallelize the tensor parallelism efficiently. In: Proceedings of the 51st International Conference on Parallel Processing. 2022, 12

[107]

Bian Z, Xu Q, Wang B, You Y. Maximizing parallelism in distributed training for huge neural networks. 2021, arXiv preprint arXiv:2105.14450

[108]

Li S, Liu H, Bian Z, Fang J, Huang H, Liu Y, Wang B. Colossal-AI: a unified deep learning system for large-scale parallel training. In: Proceedings of the 52nd International Conference on Parallel Processing. 2023, 766−775

[109]

Harlap A, Narayanan D, Phanishayee A, Seshadri V, Devanur N, Ganger G, Gibbons P. PipeDream: fast and efficient pipeline parallel DNN training. 2018, arXiv preprint arXiv: 1806.03377

[110]

Yang B, Zhang J, Li J, Re C, Aberger C, De Sa C. PipeMare: asynchronous pipeline parallel DNN training. In: Proceedings of the 4th Conference on Machine Learning and Systems. 2021, 269−296

[111]

Qi P, Wan X, Amar N, Lin M. Pipeline parallelism with controllable memory. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 1476

[112]

Huang Y, Cheng Y, Bapna A, Firat O, Chen M X, Chen D, Lee H, Ngiam J, Le Q V, Wu Y, Chen Z. GPipe: efficient training of giant neural networks using pipeline parallelism. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 10

[113]

Fan S, Rong Y, Meng C, Cao Z, Wang S, Zheng Z, Wu C, Long G, Yang J, Xia L, Diao L, Liu X, Lin W. DAPPLE: a pipelined data parallel approach for training large models. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2021, 431−445

[114]

Qi P, Wan X, Huang G, Lin M. Zero bubble (almost) pipeline parallelism. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[115]

Li S, Xue F, Baranwal C, Li Y, You Y. Sequence parallelism: long sequence training from system perspective. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 2391−2404

[116]

Jacobs S A, Tanaka M, Zhang C, Zhang M, Aminadabi R Y, Song S, Rajbhandari S, He Y. System optimizations for enabling training of extreme long sequence transformer models. In: Proceedings of the 43rd ACM Symposium on Principles of Distributed Computing. 2024, 121−130

[117]

Liu H, Zaharia M, Abbeel P. Ring attention with Blockwise transformers for near-infinite context. 2023, arXiv preprint arXiv: 2310.01889

[118]

Liu H, Abbeel P. Blockwise parallel transformers for large context models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 386

[119]

Huang C C, Jin G, Li J. SwapAdvisor: pushing deep learning beyond the GPU memory limit via smart swapping. In: Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. 2020, 1341−1355

[120]

Rhu M, Gimelshein N, Clemons J, Zulfiqar A, Keckler S W. vDNN: virtualized deep neural networks for scalable, memory-efficient neural network design. In: Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 2016, 1−13

[121]

Le T D, Imai H, Negishi Y, Kawachiya K. TFLMS: large model support in TensorFlow by graph rewriting. 2019, arXiv preprint arXiv: 1807.02037

[122]

Ren J, Rajbhandari S, Aminabadi R Y, Ruwase O, Yang S, Zhang M, Li D, He Y. ZeRO-offload: democratizing billion-scale model training. In: Proceedings of 2021 USENIX Annual Technical Conference. 2021, 551−564

[123]

Rajbhandari S, Ruwase O, Rasley J, Smith S, He Y. ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2021, 59

[124]

Dao T, Fu D Y, Ermon S, Rudra A, C. FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1189

[125]

Dao T. FlashAttention-2: faster attention with better parallelism and work partitioning. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[126]

Shah J, Bikshandi G, Zhang Y, Thakkar V, Ramani P, Dao T. Flashattention-3: fast and accurate attention with asynchrony and low-precision. In: Proceedings of the 38th International Conference on Neural Information Processing Systems, 2024, 2193

[127]

Zhou Q, Wang H, Yu X, Li C, Bai Y, Yan F, Xu Y. MPress: democratizing billion-scale model training on multi-GPU servers via memory-saving inter-operator parallelism. In: Proceedings of 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 2023, 556−569

[128]

Wattanawong T, Keutzer K. Hardware software co-design and architectural optimization of deep learning models for natural language processing. Electrical Engineering and Computer Sciences University of California, Dissertation, 2023

[129]

Guo C, Cheng F, Du Z, Kiessling J, Ku J, Li S, Li Z, Ma M, Molom-Ochir T, Morris B, Shan H, Sun J, Wang Y, Wei C, Wu X, Wu Y, Yang H F, Zhang J, Zhang J, Zheng Q, Zhou G, Li H, Chen Y . A survey: collaborative hardware and software design in the era of large language models. IEEE Circuits and Systems Magazine, 2025, 25( 1): 35–57

[130]

Rabe M N, Staats C. Self-attention does not need O(n2) memory. 2022, arXiv preprint arXiv: 2112.05682

[131]

Song S L, Kruft B, Zhang M, Li C, Chen S, , . DeepSpeed4Science initiative: enabling large-scale scientific discovery through sophisticated AI system technologies. 2023, arXiv preprint arXiv: 2310.04610

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap
PDF (2228KB)

Supplementary files

Highlights

430

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/