BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism

Runzhe CHEN, Guandong LU, Yakai WANG, Rui ZHANG, Zheng HU, Yanming MIAO, Zhifang CAI, Jingwen LENG, Minyi GUO

PDF(5652 KB)
PDF(5652 KB)
Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (1) : 191102. DOI: 10.1007/s11704-023-3401-5
Architecture
RESEARCH ARTICLE

BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism

Author information +
History +

Abstract

As deep neural networks (DNNs) have been successfully adopted in various domains, the training of these large-scale models becomes increasingly difficult and is often deployed on compute clusters composed of many devices like GPUs. However, as the size of the cluster increases, so does the possibility of failures during training. Currently, faults are mainly handled by recording checkpoints and recovering, but this approach causes large overhead and affects the training efficiency even when no error occurs. The low checkpointing frequency leads to a large loss of training time, while the high recording frequency affects the training efficiency. To solve this contradiction, we propose BAFT, a bubble-aware fault tolerant framework for hybrid parallel distributed training. BAFT can automatically analyze parallel strategies, profile the runtime information, and schedule checkpointing tasks at the granularity of pipeline stage depending on the bubble distribution in the training. It supports higher checkpoint efficiency and only introduces less than 1% time overhead, which allows us to record checkpoints at high frequency, thereby reducing the time loss in error recovery and avoiding the impact of fault tolerance on training.

Graphical abstract

Keywords

distributed training / fault tolerance / checkpoint / pipeline parallelism / error recovery

Cite this article

Download citation ▾
Runzhe CHEN, Guandong LU, Yakai WANG, Rui ZHANG, Zheng HU, Yanming MIAO, Zhifang CAI, Jingwen LENG, Minyi GUO. BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism. Front. Comput. Sci., 2025, 19(1): 191102 https://doi.org/10.1007/s11704-023-3401-5

Runzhe Chen obtained his bachelor’s degree in information engineering from Shanghai University, China and is currently pursuing a master’s degree at Shanghai Jiao Tong University, China. He is working under the guidance of Professor Jingwen Leng and Professor Minyi Guo. His research interests encompass distributed training, AI systems, and computer architecture

Guandong Lu received the BSc degree from Shanghai Jiao Tong University, China. He is currently toward the MSc degree in the field of computer science under supervision of Dr. Jingwen Leng with Department of Computer Engineering Faculty, Shanghai Jiao Tong University, China. His research interests include resilient computing on machine learning system and computer architecture

Yakai Wang received his bachelor’s degree in Computer Science from Shanghai Jiao Tong University, China. He started his master’s degree in 2020 and is currently working with Professor Jingwen Leng at Shanghai Jiao Tong University, China. His current research interests include large-scale distributed learning and neural network reliability

Rui Zhang is a Senior Engineer at Huawei, specialized in computer architecture and system. He currently focuses on improving reliability and availability for large-scale computing systems. He received his PhD, Master, and Bachelor degrees in Computer Science and Engineering from The Ohio State University (USA), Peking University (China), and Shanghai Jiaotong University (China), respectively

Zheng Hu is the Director of Reliability Technology Lab of Huawei Technologies Co., Ltd. CCF member. He is also currently leading the trustworthy AI project, in charge of the research and innovation of key technologies towards the reliable and safe AI system. Meanwhile his research also focuses on the software reliability, reliability theory and ah-hoc networks, etc. Zheng Hu received his PhD degree in Computer Science from Lyon University in France. Before joint Huawei, he was the senior researcher in Orange Labs (France Telecom), working on the self-configuration network of smart home/smart building

Yanming Miao is the Principal engineer of the Central Software Institute of Huawei Technologies Co., Ltd, China. He focuses on the reliability, serviceability, and usability of basic software. Currently responsible for the reliability and usability of the AI framework. He received his bachelor’s degree in computer science and technology from Heilongjiang University, China

Zhifang Cai is a Principal Engineer at Huawei, responsible for reliability of large-scale AI clusters. He received his Master’s degree in Testing Technology and Automation Device from Huazhong University of Science and Technology, China in 2012. Bachelor’s degree in Measurement and Control Technology and Instrument from Hunan University, China in 2009

Jingwen Leng is Professor at Shanghai Jiao Tong University, China, and a member of the Emerging Parallel Computing Center. He graduated with a Bachelor’s degree from Shanghai Jiao Tong University, China in 2010, and obtained his PhD from the University of Texas, USA in 2016. His research focuses on AI systems, algorithm-software-hardware co-design, energy-efficient optimization of heterogeneous computing systems, and reliability optimization of heterogeneous computing systems

Minyi Guo is a Chair Professor at Shanghai Jiao Tong University, China. He is IEEE Fellow, and ACM Distinguished Member. Minyi Guo received the BS and ME degrees in Computer Science from Nanjing University, China in 1982 and 1986, respectively. From 1986 to 1994, he had been an assistant professor of the Department of Computer Science at Nanjing University, China. He received the PhD degree in information science from University of Tsukuba, Japan in 1998. His research interests include parallel and distributed processing, parallelizing compilers, cloud computing, pervasive computing, software engineering, embedded systems, green computing, and wireless sensor networks

References

[1]
Krizhevsky A, Sutskever I, Hinton G E . ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60( 6): 84–90
[2]
Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D M, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 159
[3]
Guo C, Zhang C, Leng J, Liu Z, Yang F, Liu Y, Guo M, Zhu Y. Ant: exploiting adaptive numerical data type for low-bit deep neural network quantization. In: Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 2022, 1414−1433
[4]
Wang Y, Zhang C, Xie Z, Guo C, Liu Y, Leng J. Dual-side sparse tensor core. In: Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA). 2021, 1083−1095
[5]
Guo C, Hsueh B Y, Leng J, Qiu Y, Guan Y, Wang Z, Jia X, Li X, Guo M, Zhu Y A. Accelerating sparse DNN models without hardware-support via tile-wise sparsity. In: Procedings of SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 2020, 1-15
[6]
Guo C, Qiu Y, Leng J, Zhang C, Cao Y, Zhang Q, Liu Y, Yang F, Guo M. Nesting forward automatic differentiation for memory-efficient deep neural network training. In: Proceeding of the 40th IEEE International Conference on Computer Design (ICCD). 2022, 738−745
[7]
Choquette J, Gandhi W. NVIDIA A100 GPU: performance & innovation for GPU computing. In: Proceedings of 2020 IEEE Hot Chips 32 Symposium (HCS). 2020, 1−43
[8]
Liu Z, Leng J, Zhang Z, Chen Q, Li C, and Guo M. VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '22). 2022, 388–401
[9]
Jeon M, Venkataraman S, Phanishayee A, Qian U, Xiao W, Yang F. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In: Proceedings of 2019 USENIX Conference on USENIX Annual Technical Conference. 2019 , 947−960
[10]
Gizopoulos D, Papadimitriou G, Chatzidimitriou A, Reddi V J, Salami B, Unsal O S, Kestelman A C, Leng J. Modern hardware margins: Cpus, gpus, fpgas recent system-level studies. In: Proceedings of the 25th IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS). 2019, 129–134
[11]
Papadimitriou G, Chatzidimitriou A, Gizopoulos D, Reddi V J, Leng J, Salami B, Unsal O S, Kestelman A C. Exceeding conservative limits: a consolidated analysis on modern hardware margins. IEEE Transactions on Device and Materials Reliability, 2020, 20(2): 341–350
[12]
Qiu Y, Leng J, Guo C, Chen Q, Li C, Guo M, Zhu Y. Adversarial defense through network profiling based path extraction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 4777–4786
[13]
Leng J, Buyuktosunoglu A, Bertran R, Bose P, Chen Q, Guo M, Janapa Reddi V. Asymmetric resilience: exploiting task-level idempotency for transient error recovery in accelerator-based systems. In: Proceedings of 2020 IEEE International Symposium on High Performance Com puter Architecture (HPCA). 2020, 44–57
[14]
Mohan J, Phanishayee A, Chidambaram V. CheckFreq: Frequent, fine-grained DNN checkpointing. In: Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST 21). 2021, 203−216
[15]
Eisenman A, Matam K K, Ingram S, Mudigere D, Krishnamoorthi R, Nair K, Smelyanskiy M, Annavaram M. Check-N-Run: a checkpointing system for training deep learning recommendation models. In: Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 2022, 929−943
[16]
Nicolae B, Li J, Wozniak J M, Bosilca G, Dorier M, Cappello F. DeepFreeze: towards scalable asynchronous checkpointing of deep learning models. In: Proceedings of the 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). 2020, 172−181
[17]
Li S, Zhao Y, Varma R, Salpekar O, Noordhuis P, Li T, Paszke A, Smith J, Vaughan B, Damania P, Chintala S . PyTorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 2020, 13( 12): 3005–3018
[18]
Zeng W, Ren X, Su T, Wang H, Liao Y, Wang Z, Jiang X, Yang Z, Wang K, Zhang X, Li C, Gong Z, Yao Y, Huang X, Wang J, Yu J, Guo Q, Yu Y, Zhang Y, Wang J, Tao H, Yan D, Yi Z, Peng F, Jiang F, Zhang H, Deng L, Zhang Y, Lin Z, Zhang C, Zhang S, Guo M, Gu S, Fan G, Wang Y, Jin X, Liu Q, Tian Y. PanGu-α: large-scale autoregressive pretrained Chinese language models with auto-parallel computation. 2021, arXiv preprint arXiv: 2104.12369
[19]
Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Korthikanti V, Vainbrand D, Kashinkunti P, Bernauer J, Catanzaro B, Phanishayee A, Zaharia M. Efficient large-scale language model training on GPU clusters using megatron-LM. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2021, 58
[20]
Zheng L, Li Z, Zhang H, Zhuang Y, Chen Z, Huang Y, Wang Y, Xu Y, Zhuo D, Xing E P, Gonzalez J, Stoica I. Alpa: automating inter- and intra-operator parallelism for distributed deep learning. In: Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation. 2022, 559−578
[21]
Xu Y, Lee H, Chen D, Hechtman B A, Huang Y, Joshi R, Krikun M, Lepikhin D, Ly A, Maggioni M, Pang R, Shazeer N, Wang S, Wang T, Wu Y, Chen Z. GSPMD: general and scalable parallelization for ML computation graphs. 2021, arXiv preprint arXiv: 2105.04663
[22]
Li S, Hoefler T. Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In: Proceedings of SC21: International Conference for High Performance Computing, Networking, Storage and Analysis. 2021, 1−14
[23]
Radford A, Narasimhan K. Improving language understanding by generative pre-training. 2018
[24]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
[25]
Merity S, Xiong C, Bradbury J, Socher R. Pointer sentinel mixture models. In: Proceedings of the 5th International Conference on Learning Representations. 2017
[26]
Chen Y, Yang Q, He S, Shi Z, Chen J. FTPipeHD: a fault-tolerant pipeline-parallel distributed training framework for heterogeneous edge devices. 2021, arXiv preprint arXiv: 2110.02781
[27]
Nicolae B, Moody A, Gonsiorowski E, Mohror K, Cappello F. VeloC: towards high performance adaptive asynchronous checkpointing at large scale. In: Proceedings of 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2019, 911−920
[28]
Li P, Koyuncu E, Seferoglu H. Respipe: resilient model-distributed DNN training at edge networks. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021, 3660−3664
[29]
Maeng K, Bharuka S, Gao I, Jeffrey M C, Saraph V, Su B Y, Trippel C, Yang J, Rabbat M, Lucia B, Wu C J. CPR: understanding and improving failure tolerant training for deep learning recommendation with partial recovery. 2020, arXiv preprint arXiv: 2011.02999

Acknowledgements

This work was supported by the National Key R&D Program of China (2021ZD0110104), the National Natural Science Foundation of China (Grant Nos. 62222210, U21B2017, 61832006, and 62072297). The authors would like to thank the anonymous reviewers for their constructive feedback for improving the work. Any opinions, findings, and conclusions in this paper are those of the authors only and do not necessarily reflect the views of our sponsors.

Competing interests

The authors declare that they have no competing interests or financial conflicts to disclose.

RIGHTS & PERMISSIONS

2025 Higher Education Press
AI Summary AI Mindmap
PDF(5652 KB)

Accesses

Citations

Detail

Sections
Recommended

/