BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism

Runzhe CHEN, Guandong LU, Yakai WANG, Rui ZHANG, Zheng HU, Yanming MIAO, Zhifang CAI, Jingwen LENG, Minyi GUO

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (1) : 191102.

PDF(5652 KB)
Front. Comput. Sci. All Journals
PDF(5652 KB)
Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (1) : 191102. DOI: 10.1007/s11704-023-3401-5
Architecture
RESEARCH ARTICLE

BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism

Author information +
History +

Abstract

As deep neural networks (DNNs) have been successfully adopted in various domains, the training of these large-scale models becomes increasingly difficult and is often deployed on compute clusters composed of many devices like GPUs. However, as the size of the cluster increases, so does the possibility of failures during training. Currently, faults are mainly handled by recording checkpoints and recovering, but this approach causes large overhead and affects the training efficiency even when no error occurs. The low checkpointing frequency leads to a large loss of training time, while the high recording frequency affects the training efficiency. To solve this contradiction, we propose BAFT, a bubble-aware fault tolerant framework for hybrid parallel distributed training. BAFT can automatically analyze parallel strategies, profile the runtime information, and schedule checkpointing tasks at the granularity of pipeline stage depending on the bubble distribution in the training. It supports higher checkpoint efficiency and only introduces less than 1% time overhead, which allows us to record checkpoints at high frequency, thereby reducing the time loss in error recovery and avoiding the impact of fault tolerance on training.

Graphical abstract

Keywords

distributed training / fault tolerance / checkpoint / pipeline parallelism / error recovery

Cite this article

Download citation ▾
Runzhe CHEN, Guandong LU, Yakai WANG, Rui ZHANG, Zheng HU, Yanming MIAO, Zhifang CAI, Jingwen LENG, Minyi GUO. BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism. Front. Comput. Sci., 2025, 19(1): 191102 https://doi.org/10.1007/s11704-023-3401-5

1 Introduction

Neural networks are widely used in various domains with their superior performance. With the development of deep learning algorithms, much more higher-performance neural networks have been proposed. The recent works in deep learning (DL) reflects this trend, where larger models tend to provide better performance. At the same time, the scale of deep learning models has increased exponentially in the past years. From the Alexnet [1] with 60 millions parameters to the current large model such as Megatron Turing NLG [2], which has 530 billions parameters in total, the amount of the parameters in the model has increased approximately tens of thousands of times.
The substantial volume of model parameters results in an exceptionally slow training process, to solve this problem, solutions like quantization [3], designing specific architecture [4], pruning models and leveraging models’ sparsity [5], developing high performance operators [6]. Despite the availability of these methods to accelerate training efficiency, the sheer scale of model parameters has rendered even NVIDIA A100 [7], the most state-of-the-art GPUs incapable of accommodating an entire model. Consequently, distributed training has become an inevitable choice for training large models. Distributed training reduces the training time by splitting the data set into many small data sets and training on multiple nodes at the same time (supporting larger batch size); it can also partitioned the model into each node, so that the number of parameters in one node can be reduced.
However, using distributed training significantly increases the training cost and time. In addition to resource management [8], fault tolerance should also be taken into consideration in distributed training. Due to the long time training on large scale cluster, the interruptions to DNN training system are common [9], which may caused by the hardware [10,11] or software errors. These interruptions have negative effect on the training: when any node fails, the entire training task will also stop, leading to hours or even days of training time loss.
Unlike the fault tolerance strategies on model inference [12,13], that during model training primarily focus on the recovery of the model’s state. When interruptions happen, all data in memory, such as model weights, will be destroyed. Users can only resume the training by reloading the previous checkpoint, which requires the system to save checkpoint periodically. Although this method of recovery preserves most of the training progress, it also makes the global training state roll back to the state of the previous checkpoint, resulting in the loss of the training time from the checkpoint to the failure.
The all overhead Oall of checkpointing framework mainly includes: checkpoint saving overhead Osave and overhead owing to lost computation Olost.
OtotalOsave+Olost.
These two kinds of overhead can be calculated as follows:
Osave=TtotalTsave,
Olost=Tsave2×TtotalTfail.
The Ttotal, Tsave, and Tfail represent the total training time, the time interval between two checkpoints, and the time interval between two errors, respectively.
It can be known from the above formula that if the interval of checkpoints can be reduced, the lost computation when a error occurs can be reduced. However, since checkpoint itself requires communication between computing device and memory storage, and may cause the computing stall when the process is invoked, it also has a large overhead [14, 15]. Only increasing the frequency of checkpoint will lead to an increase in the training process time, which is not worth the gain in terms of the final training speed.
Given the aforementioned contradictions, we require a low-cost fault-tolerant method that is suitable for distributed training. This method should not only minimize the time overhead required to record checkpoints but also reduce the training time loss during error recovery.
Previous distributed cluster fault-tolerant solutions, such as VeloC [16], were often optimized for I/O between memory to external storage. However, in large-model training scenarios, such tools do not consider the impact of the checkpoint process itself on training, so they cannot meet the requirements of high-frequency backup during training and quick recovery after errors.
In distributed training, computing devices collaborate synchronously, which means that each device spends a significant amount of time waiting for other devices to complete their tasks. For instance, in pipeline parallelism scenarios, a computing device initiates local computation only after receiving results from a neighboring device. These wait times, commonly referred to as “bubbles,” are prevalent in distributed training.
However, current fault-tolerant schemes result in significant overhead and consume system resources during checkpointing. Our analysis of this issue suggests that the bubble time during training can be utilized for checkpoint recording to hide overhead. In response to this, we propose a plug-in fault-tolerant framework called BAFT in this paper. In summary, BAFT achieves the following contributions:
● BAFT provides a full-recovery checkpoint mechanism for distributed training, thus avoiding the accuracy loss brought by weight’s inconsistency which often occurs in partial recovery.
● BAFT saves checkpoints of one node on its neighbor node, to cope with the failure during the training. Based on this, we implement optimizations depending on the characteristics of distributed training and achieve checkpointing with negligible overhead.
● Through our evaluation, we show that BAFT causes less checkpointing overhead than existing fault-tolerance mechanisms for distributed training. Furthermore, it is a general solution that can be used as long as pipeline parallelism exists in the distributed training strategy.

2 Background

This section briefly introduces three basic parallel strategies, their characteristics for distributed training of large models, and the necessity of a fault-tolerant framework for DNN model training.

2.1 Modes of parallelism

Due to the requirement of training large models like GPT-3 [2] with the larger batch size and higher training throughput, the training task is often deployed on a cluster and trained in parallel by numerous devices. At present, the three main parallel modes are data parallelism, tensor parallelism, and pipeline parallelism.

2.1.1 Data parallelism

Data parallelism simply means that the training dataset is partitioned and trained on the parallel training device. During the same training interval, different devices use their own dataset to train the model. And at the end of the computation of one batch of samples, model gradient aggregation and state synchronization between devices is performed. Within a training interval, each device can use the data of its own shards for model training in parallel, thus greatly accelerating the training of the overall model. Data parallelism provides a near-linear speedup, significantly improving training efficiency [17]. However, since data parallelism needs to synchronize model parameters for large models, a large number of parameters will bring a lot of communication overhead and become a bottleneck in the training.

2.1.2 Tensor parallelism

Instead of partitioning the dataset, tensor parallelism partitions the weight of model into devices. Each device is accountable for computing the input data with local parameters. After the computation is completed, all devices responsible for the computation of the same layer will communicate with each other, and gather the output, then the computation of the next layer will be carried out. In the case of tensor parallelism, data communication depends on how the parameters of operators are divided. Normally, after computing a split operator, the output data is gathered between devices. Therefore, tensor parallelism requires a lot of data communication during forward and backward computation.

2.1.3 Pipeline parallelism

Pipeline parallelism splits the model according to its computation order. Each node is responsible for the computation of specific consecutive layers. During forward propagation, each node obtains the activation value from the previous node and uses it to compute local layers. After the local forward computation, it will send the activation to the next node. Each node obtains the gradient from the next node when backward propagation starts. After the backward computation, the gradients are sent to the previous node. In pipeline parallelism, each node stalls the computation until it receives the activations/gradients from its previous/next node, which causes time bubbles during training.

2.2 Modern distributed training

At present, the training of large models, such as Pangu-α [18], Megatron-LM [19], is performed in a hybrid parallel manner which includes the three aforementioned parallel modes.
Fig.1 is an example of hybrid parallelism in Megatron-LM [19]. In this example, we have 16 GPUs denoted by gi,i[0,15]. We use every 2 GPUs to construct a data parallel group, 2 GPUs to construct a tensor parallel group, and 4 GPUs to construct a pipeline group. Thus we create 8 data parallel groups, 8 tensor parallel groups, and 4 pipeline groups as:
Fig.1 Hybrid parallelism

Full size|PPT slide

8 data parallel groups:
[g0,g2],[g1,g3],[g4,g6],[g5,g7],
[g8,g10],[g9,g11],[g12,g14],[g13,g15]
8 tensor parallel groups:
[g0,g1],[g2,g3],[g4,g5],[g6,g7],
[g8,g9],[g10,g11],[g12,g13],[g14,g15]
4 pipeline parallel groups:
[g0,g4,g8,g12],[g1,g5,g9,g13],
[g2,g6,g10,g14],[g3,g7,g11,g15]
During the training process, as Fig.1 presents, each GPU communicates correspondingly within its own parallel group. For instance, the g4 performs all-reduce with g6 to aggregate gradients (data parallelism), gathers tensor when the split operator is complete (tensor parallelism) and sends/receives activations and gradients with the neighbor GPUs g0, g8 (pipeline parallelism).

3 Bubble aware fault-tolerant strategy

The bubbles mean the time the device was idle during training. As mentioned earlier, in hybrid parallelism, each node needs to wait for activation values or gradients sent from predecessor or successor nodes to proceed with the computation after completing the local model calculation task. While waiting for the data to arrive, the device is in a stalled state, which leads to the generation of bubbles. These bubbles lead to a waste of system resources, but we can use them for fault tolerance to mask the overhead brought by the checkpointing as Fig.2 presents.
Fig.2 Comparison of the two approach of checkpointing. (a) Normal checkpointing timeline; (b) BAFT style checkpointing timeline

Full size|PPT slide

3.1 Overhead of checkpoint saving

Former work CheckFreq [14] proposes a two-step checkpoint-saving approach, enabling checkpoint saving to be performed asynchronously with training. The current research about checkpoint for deep neural networks(DNNs) training, such as Check-N-Run [15], also uses this approach to decouple the checkpointing process from the training.
The basic CheckFreq-style checkpointing mechanism includes two phase, Snapshot and Persist. Snapshot creates a copy of the model in the memory, and Persist asynchronously writes the snapshot to the persistent storage, such as a hard disk, remote file server, etc. This two-step design takes full advantage of the device performance in a single-node training scenario but does not take advantage of the distributed training features. In BAFT’s design, we implement a similar Snapshot to decouple checkpoint processing from training. As for Persist, we replace it with a Transfer in BAFT to save checkpoint in the neighbor node’s memory. The snapshot of each node will be transferred to the neighbor nodes, which act as a backup. The details and main overhead of these two phases are introduced as follows:

3.1.1 Snapshot

The target of Snapshot is decoupling the checkpoint from computation. In the distributed training environment, each node creates a snapshot for its local part of the model. The Snapshot requires to be performed atomically to ensure consistency. Therefore, Snapshot is called between two weight update steps. When spare GPU memory is enough to contain the snapshot, the model parameters will be copied into GPU memory (GPU snapshot); otherwise, they will be copied directly to the CPU memory (CPU snapshot).
GPU snapshot has higher performance due to avoiding data copying between CPU and GPU memory. And CPU snapshot consumes PCI-E bandwidth, which may be occupied by the process of samples’ loading at the start of an iteration.

3.1.2 Transfer

Transfer in BAFT sends the snapshot of model weights from local memory to the neighbor node’s memory. This phase is invoked after the Snapshot is finished. In this phase, the communication bandwidth is occupied. So, it must avoid the existing communication processes during training. Transfer is a node-to-node function, it includes the Send in the source device and the Receive in the destination device. Therefore, the receiver and the sender should schedule the communication and start the Transfer simultaneously to minimize the time spent waiting during communication. If the Transfer and the other communication tasks are not scheduled well, the original communication, such as the activations/gradients send/receive in pipeline parallelism and the tensor reduction in tensor parallelism, will be stalled when the Transfer is being processed.
In summary, the two steps of checkpointing will cost CPU-GPU memory bandwidth and the bandwidth between devices, respectively. Therefore, they do not directly affect the computation of the training. Furthermore, if we can properly schedule both the existing communication and the communication used for checkpointing, we can avoid its impact on the overall training, and thus realize the low-cost checkpointing.

3.2 Bubbles in distributed training

In hybrid parallelism, bubbles are mainly brought by pipeline parallelism’s communication between neighbor nodes. The communication required for pipeline parallelism is only the transmission of gradients and activation values, which only brings a little communication volume. Therefore, most of the time of each node after calling the communication API is spent waiting. To reduce the time bubbles during training, pipeline parallelism divides the mini-batch into micro-batch. This enables the different devices to cope with the different micro-batch simultaneously.
We measured the proportion of bubbles in hybrid parallel training in different batches and parallel strategies. On hardware, we used a 4-server training set composed of 4 A40 GPUs on each server. In the evaluation, we use Megatron-LM, a well-optimized distributed training framework, to run the training task. Megatron-LM has its own strategy to partition the model according to the parallel strategy and leave fewer bubbles than the other framework without optimization. In addition, compared with other frameworks for automated distributed training, such as Alpa [20] and GSPMD [21], Megatron has a higher throughput, which means fewer bubbles, ensuring that BAFT is tested in a more rigorous environment.Therefore, if we can use the bubbles caused by the Megatron-LM to mask the overhead brought by checkpointing, then the same mechanism can be effectively plugged into other distributed training frameworks, which causes more time bubbles.
To visualize the distribution of bubbles in distributed training, we tested different models with different parallel strategies on our cluster. Fig.3 shows the bubbles’ percentage of each device in our evaluation. As for the configuration, 4n4g denotes that our cluster is constructed by 4 servers on which there are 4 GPUs. The following string, such as d2m2p4, denotes the scale of different parallelism. For instance, d2m2p4 means there are 2 GPUs in a data parallel group, 2 GPUs in a tensor parallel group, and 4 GPUs in a pipeline parallel group, respectively.
Fig.3 Bubble percentages of different configurations. (a) g128m8; (b) g64m8

Full size|PPT slide

It can be seen that due to the differences in models held by each device and their responsibilities for computing tasks, the proportion of bubble time varies as well. In our experiment, bubble time accounts for at least as much as 20% of the time.
For larger-scale distributed training, many previous efforts have attempted to reschedule the execution order of pipelines, in order to reduce bubble during distributed training. However, these efforts still cannot completely eliminate pipeline bubble. Former researches aim to reduce the time bubble of pipeline parallelism. By re-scheduling the execution order of tasks on each device, the bubble ratio can be significantly reduced compared to the original pipeline model. Chimera [22] conducted experiments on 2048 Nvidia A100 GPUs to train with GPT-2, and the bubble ratio is shown in the Fig.4.
Fig.4 Bubble ratio of different kinds of pipeline parallelisms in a large scale cluster

Full size|PPT slide

In summary, network bandwidth is required between nodes when using the Snapshot+Transfer method to record checkpoints on neighboring nodes. Using pipeline parallelism for training introduces time bubbles, in which network bandwidth is idle for a long time except for transmitting activations or gradients. Therefore, we can make use of time bubbles to perform neighbor node checkpointing.

4 BAFT overview

BAFT is a distributed checkpointing framework for large-scale training tasks. On the premise of ensuring the accuracy of the model, according to the characteristics of hybrid parallelism, BAFT designs a new mechanism so that we can record checkpoints with negligible overhead. BAFT implements a chain replication, which means that each node will save its checkpoint on the neighbor node. For different types of errors, BAFT uses corresponding error recovery schemes: For failure on single GPU: Due to the existence of data parallelism in distributed training, the GPUs in the same data parallel group hold the same model weight. Therefore, for a single GPU, other GPUs in the same data parallel group provide it with a natural checkpoint of model weight. For such failures, BAFT will not invoke additional fault tolerance mechanisms during training, and will not bring overhead. For failure on server: Considering the amount of data communication, GPUs on the same server are usually assigned to the same data parallel group to avoid cross-group communication of a large amount of data affecting training efficiency. Therefore, when the failure occurs on a server, all the GPUs on it are inaccessible, so that we cannot find the data redundancy brought by data parallelism. To cover this kind of failure, BAFT saves checkpoint on the neighbor server during training to ensure the accessibility of checkpoints. As the analysis in Section 3.2 shows, each device has a certain percentage of bubbles at training time, which allows us to exploit it to mask the overhead brought by fault-tolerance mechanisms.
In this section, we present the overview of BAFT, which consists of three phases, profiling runtime information, generating checkpoint plan, and executing checkpoint function as the Fig.5 presents.
Fig.5 System overview of BAFT

Full size|PPT slide

4.1 Profiling runtime information

Before starting the checkpointing, BAFT will collect the runtime information of each node in the first few iterations of training.
In pipeline parallelism, the training can be divided into stages, such as forward-compute, backward-compute, forward-send, forward-receive, backward-send and backward receive, etc. In each iteration, the order and duration of each stage are fixed for each node. By identifying these stages, BAFT is able to find the idle time of each device during execution and mark it as a bubble. In order to avoid the influence of checkpointing on the original communication, BAFT will record the communication data volume between the local node and the neighbor node at the bubble. The information profiled at this phase will be used later to generate a checkpoint plan.
Furthermore, during the profiling phase, the memory usage of each device at every step in the training process is recorded. When generating a checkpoint policy, BAFT will choose to use GPU snapshots or CPU snapshots based on memory usage.

4.2 Generating checkpoint plan

This phase generates the plan of training. When the training process reaches the preset steps in the checkpoint plan, BAFT will determine whether to perform checkpointing at this time or continue training.
The checkpoint plan includes:

4.2.1 Destination to accommodate the checkpoint

For each node, the checkpointing scheme needs to determine on which node the local data needs to be saved. We choose the prev node of the current node in the pipeline group. It is based on the fact that on devices that hold adjacent pipeline stages, bubbles are more closely distributed, so there is more time intersection of bubbles available for checkpointing.

4.2.2 Time to checkpoint

BAFT determines the time to checkpoint depending on the runtime information collected in the former profiling phase. The generation of Snapshot starts immediately after the weight update. And the time to start Transfer depends on the checkpoint plan generated by Algorithm 1.
For each pipeline stage, Algorithm 1 will iterate over the bubbles of the nodei and its target nodei1 determined in 4.2.1. As Fig.6 shows, in order to avoid the influence on training, BAFT calculates the time required for regular communication according to the communication’s volume and network bandwidth and reserves the corresponding time at the start and the end of the bubble. Through the Algorithm 1, the ckptSendPlan and the ckptRecvPlan are generated, which contain the bubbles when the BAFT needs to call the send and receive functions, respectively.
Fig.6 Reserve the time for regular communication and pick the bubbles for Transfer

Full size|PPT slide

We set a variable m to denotes the number of parameters that can be transferred by the bubbles up to now. Every time the bubble is appended to the plan, the number of parameters caculated according to the duration of bubble and the bandwidth, will be added to m, until m reaches the number of parameters that need to be transferred.
It is worth noting that if the bubbles during one iteration are not enough to transmit all the local parameters to the neighbor node, the function GetNextBubble will return the bubbles in the next iteration. Accordingly, the granularity of checkpoint scheduling becomes n iterations that n iterations have enough bubbles for parameter checkpointing. In other words, the checkpoint will be recorded every n iterations. It allows us to mask the overhead of checkpointing completely. Since we are transferring data from previous snapshots, checkpointing across multiple iterations does not cause inconsistency in the saved parameters.

Full size|PPT slide

4.2.3 Parameters to be transferred

In the pipeline parallel group and model parallel group, each node accommodates different parts of the model weights. However, in the same data parallel group, all nodes contain the same weights. In order to minimize the amount of data that needs to be transferred, the data held by a data parallel group with n nodes are equally divided into n parts, and each node is responsible for checkpointing one of these.
Based on the previous analysis of bubbles, we already know the duration of the bubble that can be utilized each time to save data on the neighbor node. To improve the utilization of bubbles, we need to transfer as much data as possible in each bubble. To maximize the data transferred during the bubble, BAFT uses the layer as the granularity and uses the method of dynamic programming as Algorithm 2 to select the weight that needs to be transmitted this time for each bubble. The number of parameters to be transferred M in the input is determined by: the bubble’s duration and the bandwidth between nodes.

Full size|PPT slide

4.3 Executing checkpoint function

When the training progresses to the bubble marked in the checkpoint plan, the corresponding time will be used by Transfer to back up the local parameters to the neighbor nodes. When a node receives all data from its neighbors, it is marked as complete and will not be overwritten by other data until the next round of parameter checkpointing is successful. It ensures that we have consistent data to recover from when the device crashes after transmitting a part of the parameters.

4.4 Error recovery

In order to address the issue of error recovery in distributed training, we have developed a specialized module designed to recover model parameters in the event of node failure. Within the context of the BAFT framework, each node is responsible for storing the model parameters of its respective predecessor nodes within the pipeline parallel group. Should a communication error occur, the saved model parameters can be returned for retrieval. In the event of a communication timeout from the predecessor node, the local node will proceed to examine the data parallel group to which the GPU on the predecessor node server is assigned. Following this, the local node will select one of the GPUs within the same data parallel group to serve as the recipient for the backup model parameters. Once the corresponding model parameters have been sent to the selected GPU on the predecessor node server, the model parameters are broadcasted within the same data parallel group, effectively restoring the model parameters of all GPUs on the server. This method of transmitting before broadcasting not only eliminates the cross-boundary transmission of duplicate data but also capitalizes on the high bandwidth within the node for broadcasting, thus enhancing the overall efficiency of data recovery.

5 Evaluation

5.1 BAFT implementation

We implement BAFT with Pytorch v.1.12.0. Communication between nodes is achieved by calling the APIs provided by torch.distributed. We adapted BAFT to Megatron-LM and took advantage of its scheduler to plug in checkpointing phases during regular training. Furthermore, we modified the Megatron-LM’s timer to implement the function of runtime profiling mentioned in 4.1.

5.2 Experimental set-up

To assess the impact of BAFT on training, we conducted experiments on our GPU cluster, which comprises four servers, each equipped with four NVIDIA A40 GPUs. We leveraged all 16 GPUs to perform the training tasks. In our experiments, the 4 GPUs within the same server are interconnected with PCI-E 4.0 x16 links, providing a bandwidth of 31.5 GB/s between GPUs. Communication among 4 servers is achieved using a 10 Gbps network. The models used in our evaluation include GPT-2 [23], Bert-large, and Bert-exlarge [24], as presented in Tab.1, with multiple configurations to evaluate the performance of BAFT. The first number in the batch size indicates the global batch size, while the second denotes the mini-batch size. The WikiText [25] dataset was utilized as input for training the models.
Tab.1 Training configurations
Models Training configuration
Batch size Parallel strategy
Bert-Large Bert-Exlarge GPT-2 (128,8)(64, 8) d2m2p4
d4m1d4
d1m4d4
d2m1p8
d1m2p8
Tab.2 Comparison of related work
Name Support different models Support distributed training Time overhead No affect on convergence
CheckFreq [14] × Low
Deepfreeze [27] Low ×
FTPipeHD [26] High
ResPipe [28] High
Check-N-Run [15] × Low ×
CPR [29] Low ×
BAFT Low

5.3 Training performance

In order to assess the practical performance of BAFT during training, we trained various models using different parallel configurations listed in Tab.1, and observed their performance differences under three conditions: no fault tolerance, fault tolerance using BAFT, and fault tolerance using FTPipeHD [26], a fault tolerance framework designed for pipeline parallelism. Each test involved conducting 1000 iterations and calculating their average time. No fault tolerance indicates that no backup was performed during training. Furthermore, we implemented the fault-tolerant mechanism used by FTPipeHD based on pipeline parallelism characteristics, and compared it with BAFT. The bar chart in the Fig.7 illustrates the time required for one iteration under different configurations, while the line chart depicts the percentage of additional overhead incurred by BAFT compared to the no fault tolerance condition. The graph indicates that after implementing BAFT, the time overhead for each iteration was no more than 2% when compared to the no fault tolerance condition, while the backup method used by FTPipeHD without utilizing bubbles led to significant time overhead.
Fig.7 Comparison of checkpointing overhead between BAFT and FTPipeHD. (a) Batch size (128,8); (b) batch size (64,8)

Full size|PPT slide

In the experiments of this study, we attempted various parallel configurations on an equally sized training cluster. This effectively altered the communication method between adjacent nodes in the pipeline group. In some configurations, neighboring nodes communicated via the local PCI-E, while in others, communication occurred through the network. The experimental results indicate that changing the communication method had minimal impact on the performance of BAFT.
In fact, different communication methods do not affect the amount of parameters that need to be checkpointed. Therefore, the proportion of parameters a node needs to checkpoint within its total communication volume remains constant. This also ensures that regardless of the communication method used to communicate with neighboring nodes, BAFT consistently maintains the introduced overhead’s proportion within an acceptable range of the total training time.

5.4 Time loss of error recovery

As mentioned in Section 1, the time loss mainly depends on the checkpoint interval Tsave. In our evaluation, based on the premise that the overhead brought by checkpointing can be completely hidden, we chose the highest checkpointing frequency. In most configurations, the Tsave is time of one or two iterations, and in the extreme case, three iterations. It makes the loss only a few iterations, i.e., only several seconds of training are wasted.
Across all configurations tested in our experiment, recording checkpoints on neighboring nodes was completed in only 1 to 3 iterations. Consequently, in the event of an error, the training time lost would only be the time spent on these 1 to 3 iterations, equivalent to approximately 0.6 to 5.5 s.
In addition, because checkpoints used in error recovery are kept on neighbor nodes, the failure node will only spend several seconds loading checkpoints. Thus, the time cost of error recovery can be greatly reduced. Fig.8 shows the time needed to reload the checkpoint from neighborhood in our experiments.
Fig.8 Time cost to reload checkpoint

Full size|PPT slide

6 Related work

Checkpoint-based error-recovery technology has been researched to mask the overhead of checkpointing and alleviate the influence caused by the execution of checkpointing. CheckFreq [14] dynamically tunes the checkpoint frequency depending on the model and the training environment. Furthermore, it divides the checkpointing procedures into two steps: copy the model state in memory, then write it to persistent storage. By pipelining these two steps, Checkfreq significantly shortens the computing stall brought by checkpointing. However, checkfreq is not optimized for the characteristics of distributed training, which results in its inability to leverage training bubbles to hide checkpointing overhead. Deepfreeze [27] uses storage-specific lightweight API to minimize the I/O operations. It also uses sharding technology, according to the features of data parallelism, to reduce the number of parameters that need to be checkpointed. Despite numerous optimizations targeting the checkpointing process, it still incurs additional overhead and impacts training efficiency due to the lack of exploiting idle resources during training. FTPipeHD [26] and ResPipe [28] specify the implementation of checkpointing in pipeline parallelism by saving the model state in the neighbor nodes. They are more focused on ensuring checkpoint availability, which ensures the retrieval of necessary parameters for recovery in case of failures. However, they lack optimizations in terms of performance, these two schemes have nearly 50% overhead in the iteration in which the checkpointing function is called. Check-N-Run [15] compresses checkpoints by means of quantization, which reduces the overhead of backups, and reduces the computation stall by decoupling checkpointing from training. According to the way the parameters are updated in the recommendation model training, Check-N-Run and CPR [29] respectively use differential checkpointing and partial recovery to reduce the system load.
In summary, as Tab.2 showed, the existing fault-tolerant schemes for distributed training are either not optimized for parallel strategies, resulting in significant overhead, or only support specific models.

7 Conclusion

This paper introduces BAFT, a fault-tolerant framework designed for distributed training that utilizes checkpointing. BAFT is a complete recovery framework that aims to mitigate the impact of partial recovery on stale weights. By leveraging the characteristics of hybrid parallel training, BAFT optimizes the checkpointing mechanism to minimize overhead while avoiding any impact on training efficiency by the fault-tolerant framework.
In practice, BAFT first profiles several iterations to determine the duration of each step within an iteration and generates a checkpointing plan that leverages bubbles to mask overhead. We integrated BAFT into Megatron-LM and evaluated its performance by training models with different parallel strategies and batch sizes. Our results indicate that BAFT can complete model checkpoint recording with less than 1% overhead. Although initially implemented in Megatron-LM, the mechanism of BAFT can be easily adapted to other distributed training frameworks. Given the widespread use of pipeline parallelism in training large models, enabling BAFT during training is a straightforward process.

Runzhe Chen obtained his bachelor’s degree in information engineering from Shanghai University, China and is currently pursuing a master’s degree at Shanghai Jiao Tong University, China. He is working under the guidance of Professor Jingwen Leng and Professor Minyi Guo. His research interests encompass distributed training, AI systems, and computer architecture

Guandong Lu received the BSc degree from Shanghai Jiao Tong University, China. He is currently toward the MSc degree in the field of computer science under supervision of Dr. Jingwen Leng with Department of Computer Engineering Faculty, Shanghai Jiao Tong University, China. His research interests include resilient computing on machine learning system and computer architecture

Yakai Wang received his bachelor’s degree in Computer Science from Shanghai Jiao Tong University, China. He started his master’s degree in 2020 and is currently working with Professor Jingwen Leng at Shanghai Jiao Tong University, China. His current research interests include large-scale distributed learning and neural network reliability

Rui Zhang is a Senior Engineer at Huawei, specialized in computer architecture and system. He currently focuses on improving reliability and availability for large-scale computing systems. He received his PhD, Master, and Bachelor degrees in Computer Science and Engineering from The Ohio State University (USA), Peking University (China), and Shanghai Jiaotong University (China), respectively

Zheng Hu is the Director of Reliability Technology Lab of Huawei Technologies Co., Ltd. CCF member. He is also currently leading the trustworthy AI project, in charge of the research and innovation of key technologies towards the reliable and safe AI system. Meanwhile his research also focuses on the software reliability, reliability theory and ah-hoc networks, etc. Zheng Hu received his PhD degree in Computer Science from Lyon University in France. Before joint Huawei, he was the senior researcher in Orange Labs (France Telecom), working on the self-configuration network of smart home/smart building

Yanming Miao is the Principal engineer of the Central Software Institute of Huawei Technologies Co., Ltd, China. He focuses on the reliability, serviceability, and usability of basic software. Currently responsible for the reliability and usability of the AI framework. He received his bachelor’s degree in computer science and technology from Heilongjiang University, China

Zhifang Cai is a Principal Engineer at Huawei, responsible for reliability of large-scale AI clusters. He received his Master’s degree in Testing Technology and Automation Device from Huazhong University of Science and Technology, China in 2012. Bachelor’s degree in Measurement and Control Technology and Instrument from Hunan University, China in 2009

Jingwen Leng is Professor at Shanghai Jiao Tong University, China, and a member of the Emerging Parallel Computing Center. He graduated with a Bachelor’s degree from Shanghai Jiao Tong University, China in 2010, and obtained his PhD from the University of Texas, USA in 2016. His research focuses on AI systems, algorithm-software-hardware co-design, energy-efficient optimization of heterogeneous computing systems, and reliability optimization of heterogeneous computing systems

Minyi Guo is a Chair Professor at Shanghai Jiao Tong University, China. He is IEEE Fellow, and ACM Distinguished Member. Minyi Guo received the BS and ME degrees in Computer Science from Nanjing University, China in 1982 and 1986, respectively. From 1986 to 1994, he had been an assistant professor of the Department of Computer Science at Nanjing University, China. He received the PhD degree in information science from University of Tsukuba, Japan in 1998. His research interests include parallel and distributed processing, parallelizing compilers, cloud computing, pervasive computing, software engineering, embedded systems, green computing, and wireless sensor networks

References

[1]
Krizhevsky A, Sutskever I, Hinton G E . ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60( 6): 84–90
[2]
Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D M, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 159
[3]
Guo C, Zhang C, Leng J, Liu Z, Yang F, Liu Y, Guo M, Zhu Y. Ant: exploiting adaptive numerical data type for low-bit deep neural network quantization. In: Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 2022, 1414−1433
[4]
Wang Y, Zhang C, Xie Z, Guo C, Liu Y, Leng J. Dual-side sparse tensor core. In: Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA). 2021, 1083−1095
[5]
Guo C, Hsueh B Y, Leng J, Qiu Y, Guan Y, Wang Z, Jia X, Li X, Guo M, Zhu Y A. Accelerating sparse DNN models without hardware-support via tile-wise sparsity. In: Procedings of SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 2020, 1-15
[6]
Guo C, Qiu Y, Leng J, Zhang C, Cao Y, Zhang Q, Liu Y, Yang F, Guo M. Nesting forward automatic differentiation for memory-efficient deep neural network training. In: Proceeding of the 40th IEEE International Conference on Computer Design (ICCD). 2022, 738−745
[7]
Choquette J, Gandhi W. NVIDIA A100 GPU: performance & innovation for GPU computing. In: Proceedings of 2020 IEEE Hot Chips 32 Symposium (HCS). 2020, 1−43
[8]
Liu Z, Leng J, Zhang Z, Chen Q, Li C, and Guo M. VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '22). 2022, 388–401
[9]
Jeon M, Venkataraman S, Phanishayee A, Qian U, Xiao W, Yang F. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In: Proceedings of 2019 USENIX Conference on USENIX Annual Technical Conference. 2019 , 947−960
[10]
Gizopoulos D, Papadimitriou G, Chatzidimitriou A, Reddi V J, Salami B, Unsal O S, Kestelman A C, Leng J. Modern hardware margins: Cpus, gpus, fpgas recent system-level studies. In: Proceedings of the 25th IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS). 2019, 129–134
[11]
Papadimitriou G, Chatzidimitriou A, Gizopoulos D, Reddi V J, Leng J, Salami B, Unsal O S, Kestelman A C. Exceeding conservative limits: a consolidated analysis on modern hardware margins. IEEE Transactions on Device and Materials Reliability, 2020, 20(2): 341–350
[12]
Qiu Y, Leng J, Guo C, Chen Q, Li C, Guo M, Zhu Y. Adversarial defense through network profiling based path extraction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 4777–4786
[13]
Leng J, Buyuktosunoglu A, Bertran R, Bose P, Chen Q, Guo M, Janapa Reddi V. Asymmetric resilience: exploiting task-level idempotency for transient error recovery in accelerator-based systems. In: Proceedings of 2020 IEEE International Symposium on High Performance Com puter Architecture (HPCA). 2020, 44–57
[14]
Mohan J, Phanishayee A, Chidambaram V. CheckFreq: Frequent, fine-grained DNN checkpointing. In: Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST 21). 2021, 203−216
[15]
Eisenman A, Matam K K, Ingram S, Mudigere D, Krishnamoorthi R, Nair K, Smelyanskiy M, Annavaram M. Check-N-Run: a checkpointing system for training deep learning recommendation models. In: Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 2022, 929−943
[16]
Nicolae B, Li J, Wozniak J M, Bosilca G, Dorier M, Cappello F. DeepFreeze: towards scalable asynchronous checkpointing of deep learning models. In: Proceedings of the 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). 2020, 172−181
[17]
Li S, Zhao Y, Varma R, Salpekar O, Noordhuis P, Li T, Paszke A, Smith J, Vaughan B, Damania P, Chintala S . PyTorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 2020, 13( 12): 3005–3018
[18]
Zeng W, Ren X, Su T, Wang H, Liao Y, Wang Z, Jiang X, Yang Z, Wang K, Zhang X, Li C, Gong Z, Yao Y, Huang X, Wang J, Yu J, Guo Q, Yu Y, Zhang Y, Wang J, Tao H, Yan D, Yi Z, Peng F, Jiang F, Zhang H, Deng L, Zhang Y, Lin Z, Zhang C, Zhang S, Guo M, Gu S, Fan G, Wang Y, Jin X, Liu Q, Tian Y. PanGu-α: large-scale autoregressive pretrained Chinese language models with auto-parallel computation. 2021, arXiv preprint arXiv: 2104.12369
[19]
Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Korthikanti V, Vainbrand D, Kashinkunti P, Bernauer J, Catanzaro B, Phanishayee A, Zaharia M. Efficient large-scale language model training on GPU clusters using megatron-LM. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2021, 58
[20]
Zheng L, Li Z, Zhang H, Zhuang Y, Chen Z, Huang Y, Wang Y, Xu Y, Zhuo D, Xing E P, Gonzalez J, Stoica I. Alpa: automating inter- and intra-operator parallelism for distributed deep learning. In: Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation. 2022, 559−578
[21]
Xu Y, Lee H, Chen D, Hechtman B A, Huang Y, Joshi R, Krikun M, Lepikhin D, Ly A, Maggioni M, Pang R, Shazeer N, Wang S, Wang T, Wu Y, Chen Z. GSPMD: general and scalable parallelization for ML computation graphs. 2021, arXiv preprint arXiv: 2105.04663
[22]
Li S, Hoefler T. Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In: Proceedings of SC21: International Conference for High Performance Computing, Networking, Storage and Analysis. 2021, 1−14
[23]
Radford A, Narasimhan K. Improving language understanding by generative pre-training. 2018
[24]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
[25]
Merity S, Xiong C, Bradbury J, Socher R. Pointer sentinel mixture models. In: Proceedings of the 5th International Conference on Learning Representations. 2017
[26]
Chen Y, Yang Q, He S, Shi Z, Chen J. FTPipeHD: a fault-tolerant pipeline-parallel distributed training framework for heterogeneous edge devices. 2021, arXiv preprint arXiv: 2110.02781
[27]
Nicolae B, Moody A, Gonsiorowski E, Mohror K, Cappello F. VeloC: towards high performance adaptive asynchronous checkpointing at large scale. In: Proceedings of 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2019, 911−920
[28]
Li P, Koyuncu E, Seferoglu H. Respipe: resilient model-distributed DNN training at edge networks. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021, 3660−3664
[29]
Maeng K, Bharuka S, Gao I, Jeffrey M C, Saraph V, Su B Y, Trippel C, Yang J, Rabbat M, Lucia B, Wu C J. CPR: understanding and improving failure tolerant training for deep learning recommendation with partial recovery. 2020, arXiv preprint arXiv: 2011.02999

Acknowledgements

This work was supported by the National Key R&D Program of China (2021ZD0110104), the National Natural Science Foundation of China (Grant Nos. 62222210, U21B2017, 61832006, and 62072297). The authors would like to thank the anonymous reviewers for their constructive feedback for improving the work. Any opinions, findings, and conclusions in this paper are those of the authors only and do not necessarily reflect the views of our sponsors.

Competing interests

The authors declare that they have no competing interests or financial conflicts to disclose.

RIGHTS & PERMISSIONS

2025 Higher Education Press
AI Summary AI Mindmap
PDF(5652 KB)

978

Accesses

1

Citations

11

Altmetric

Detail

Sections
Recommended

/