1 Introduction
Neural networks are widely used in various domains with their superior performance. With the development of deep learning algorithms, much more higher-performance neural networks have been proposed. The recent works in deep learning (DL) reflects this trend, where larger models tend to provide better performance. At the same time, the scale of deep learning models has increased exponentially in the past years. From the Alexnet [
1] with 60 millions parameters to the current large model such as Megatron Turing NLG [
2], which has 530 billions parameters in total, the amount of the parameters in the model has increased approximately tens of thousands of times.
The substantial volume of model parameters results in an exceptionally slow training process, to solve this problem, solutions like quantization [
3], designing specific architecture [
4], pruning models and leveraging models’ sparsity [
5], developing high performance operators [
6]. Despite the availability of these methods to accelerate training efficiency, the sheer scale of model parameters has rendered even NVIDIA A100 [
7], the most state-of-the-art GPUs incapable of accommodating an entire model. Consequently, distributed training has become an inevitable choice for training large models. Distributed training reduces the training time by splitting the data set into many small data sets and training on multiple nodes at the same time (supporting larger batch size); it can also partitioned the model into each node, so that the number of parameters in one node can be reduced.
However, using distributed training significantly increases the training cost and time. In addition to resource management [
8], fault tolerance should also be taken into consideration in distributed training. Due to the long time training on large scale cluster, the interruptions to DNN training system are common [
9], which may caused by the hardware [
10,
11] or software errors. These interruptions have negative effect on the training: when any node fails, the entire training task will also stop, leading to hours or even days of training time loss.
Unlike the fault tolerance strategies on model inference [
12,
13], that during model training primarily focus on the recovery of the model’s state. When interruptions happen, all data in memory, such as model weights, will be destroyed. Users can only resume the training by reloading the previous checkpoint, which requires the system to save checkpoint periodically. Although this method of recovery preserves most of the training progress, it also makes the global training state roll back to the state of the previous checkpoint, resulting in the loss of the training time from the checkpoint to the failure.
The all overhead of checkpointing framework mainly includes: checkpoint saving overhead and overhead owing to lost computation .
These two kinds of overhead can be calculated as follows:
The , , and represent the total training time, the time interval between two checkpoints, and the time interval between two errors, respectively.
It can be known from the above formula that if the interval of checkpoints can be reduced, the lost computation when a error occurs can be reduced. However, since checkpoint itself requires communication between computing device and memory storage, and may cause the computing stall when the process is invoked, it also has a large overhead [
14,
15]. Only increasing the frequency of checkpoint will lead to an increase in the training process time, which is not worth the gain in terms of the final training speed.
Given the aforementioned contradictions, we require a low-cost fault-tolerant method that is suitable for distributed training. This method should not only minimize the time overhead required to record checkpoints but also reduce the training time loss during error recovery.
Previous distributed cluster fault-tolerant solutions, such as VeloC [
16], were often optimized for I/O between memory to external storage. However, in large-model training scenarios, such tools do not consider the impact of the checkpoint process itself on training, so they cannot meet the requirements of high-frequency backup during training and quick recovery after errors.
In distributed training, computing devices collaborate synchronously, which means that each device spends a significant amount of time waiting for other devices to complete their tasks. For instance, in pipeline parallelism scenarios, a computing device initiates local computation only after receiving results from a neighboring device. These wait times, commonly referred to as “bubbles,” are prevalent in distributed training.
However, current fault-tolerant schemes result in significant overhead and consume system resources during checkpointing. Our analysis of this issue suggests that the bubble time during training can be utilized for checkpoint recording to hide overhead. In response to this, we propose a plug-in fault-tolerant framework called BAFT in this paper. In summary, BAFT achieves the following contributions:
● BAFT provides a full-recovery checkpoint mechanism for distributed training, thus avoiding the accuracy loss brought by weight’s inconsistency which often occurs in partial recovery.
● BAFT saves checkpoints of one node on its neighbor node, to cope with the failure during the training. Based on this, we implement optimizations depending on the characteristics of distributed training and achieve checkpointing with negligible overhead.
● Through our evaluation, we show that BAFT causes less checkpointing overhead than existing fault-tolerance mechanisms for distributed training. Furthermore, it is a general solution that can be used as long as pipeline parallelism exists in the distributed training strategy.
2 Background
This section briefly introduces three basic parallel strategies, their characteristics for distributed training of large models, and the necessity of a fault-tolerant framework for DNN model training.
2.1 Modes of parallelism
Due to the requirement of training large models like GPT-3 [
2] with the larger batch size and higher training throughput, the training task is often deployed on a cluster and trained in parallel by numerous devices. At present, the three main parallel modes are data parallelism, tensor parallelism, and pipeline parallelism.
2.1.1 Data parallelism
Data parallelism simply means that the training dataset is partitioned and trained on the parallel training device. During the same training interval, different devices use their own dataset to train the model. And at the end of the computation of one batch of samples, model gradient aggregation and state synchronization between devices is performed. Within a training interval, each device can use the data of its own shards for model training in parallel, thus greatly accelerating the training of the overall model. Data parallelism provides a near-linear speedup, significantly improving training efficiency [
17]. However, since data parallelism needs to synchronize model parameters for large models, a large number of parameters will bring a lot of communication overhead and become a bottleneck in the training.
2.1.2 Tensor parallelism
Instead of partitioning the dataset, tensor parallelism partitions the weight of model into devices. Each device is accountable for computing the input data with local parameters. After the computation is completed, all devices responsible for the computation of the same layer will communicate with each other, and gather the output, then the computation of the next layer will be carried out. In the case of tensor parallelism, data communication depends on how the parameters of operators are divided. Normally, after computing a split operator, the output data is gathered between devices. Therefore, tensor parallelism requires a lot of data communication during forward and backward computation.
2.1.3 Pipeline parallelism
Pipeline parallelism splits the model according to its computation order. Each node is responsible for the computation of specific consecutive layers. During forward propagation, each node obtains the activation value from the previous node and uses it to compute local layers. After the local forward computation, it will send the activation to the next node. Each node obtains the gradient from the next node when backward propagation starts. After the backward computation, the gradients are sent to the previous node. In pipeline parallelism, each node stalls the computation until it receives the activations/gradients from its previous/next node, which causes time bubbles during training.
2.2 Modern distributed training
At present, the training of large models, such as Pangu-
[
18], Megatron-LM [
19], is performed in a hybrid parallel manner which includes the three aforementioned parallel modes.
Fig.1 is an example of hybrid parallelism in Megatron-LM [
19]. In this example, we have 16 GPUs denoted by
. We use every 2 GPUs to construct a data parallel group, 2 GPUs to construct a tensor parallel group, and 4 GPUs to construct a pipeline group. Thus we create 8 data parallel groups, 8 tensor parallel groups, and 4 pipeline groups as:
8 data parallel groups:
8 tensor parallel groups:
4 pipeline parallel groups:
During the training process, as Fig.1 presents, each GPU communicates correspondingly within its own parallel group. For instance, the performs all-reduce with to aggregate gradients (data parallelism), gathers tensor when the split operator is complete (tensor parallelism) and sends/receives activations and gradients with the neighbor GPUs , (pipeline parallelism).
3 Bubble aware fault-tolerant strategy
The bubbles mean the time the device was idle during training. As mentioned earlier, in hybrid parallelism, each node needs to wait for activation values or gradients sent from predecessor or successor nodes to proceed with the computation after completing the local model calculation task. While waiting for the data to arrive, the device is in a stalled state, which leads to the generation of bubbles. These bubbles lead to a waste of system resources, but we can use them for fault tolerance to mask the overhead brought by the checkpointing as Fig.2 presents.
Fig.2 Comparison of the two approach of checkpointing. (a) Normal checkpointing timeline; (b) BAFT style checkpointing timeline |
Full size|PPT slide
3.1 Overhead of checkpoint saving
Former work CheckFreq [
14] proposes a two-step checkpoint-saving approach, enabling checkpoint saving to be performed asynchronously with training. The current research about checkpoint for deep neural networks(DNNs) training, such as Check-N-Run [
15], also uses this approach to decouple the checkpointing process from the training.
The basic CheckFreq-style checkpointing mechanism includes two phase, Snapshot and Persist. Snapshot creates a copy of the model in the memory, and Persist asynchronously writes the snapshot to the persistent storage, such as a hard disk, remote file server, etc. This two-step design takes full advantage of the device performance in a single-node training scenario but does not take advantage of the distributed training features. In BAFT’s design, we implement a similar Snapshot to decouple checkpoint processing from training. As for Persist, we replace it with a Transfer in BAFT to save checkpoint in the neighbor node’s memory. The snapshot of each node will be transferred to the neighbor nodes, which act as a backup. The details and main overhead of these two phases are introduced as follows:
3.1.1 Snapshot
The target of Snapshot is decoupling the checkpoint from computation. In the distributed training environment, each node creates a snapshot for its local part of the model. The Snapshot requires to be performed atomically to ensure consistency. Therefore, Snapshot is called between two weight update steps. When spare GPU memory is enough to contain the snapshot, the model parameters will be copied into GPU memory (GPU snapshot); otherwise, they will be copied directly to the CPU memory (CPU snapshot).
GPU snapshot has higher performance due to avoiding data copying between CPU and GPU memory. And CPU snapshot consumes PCI-E bandwidth, which may be occupied by the process of samples’ loading at the start of an iteration.
3.1.2 Transfer
Transfer in BAFT sends the snapshot of model weights from local memory to the neighbor node’s memory. This phase is invoked after the Snapshot is finished. In this phase, the communication bandwidth is occupied. So, it must avoid the existing communication processes during training. Transfer is a node-to-node function, it includes the Send in the source device and the Receive in the destination device. Therefore, the receiver and the sender should schedule the communication and start the Transfer simultaneously to minimize the time spent waiting during communication. If the Transfer and the other communication tasks are not scheduled well, the original communication, such as the activations/gradients send/receive in pipeline parallelism and the tensor reduction in tensor parallelism, will be stalled when the Transfer is being processed.
In summary, the two steps of checkpointing will cost CPU-GPU memory bandwidth and the bandwidth between devices, respectively. Therefore, they do not directly affect the computation of the training. Furthermore, if we can properly schedule both the existing communication and the communication used for checkpointing, we can avoid its impact on the overall training, and thus realize the low-cost checkpointing.
3.2 Bubbles in distributed training
In hybrid parallelism, bubbles are mainly brought by pipeline parallelism’s communication between neighbor nodes. The communication required for pipeline parallelism is only the transmission of gradients and activation values, which only brings a little communication volume. Therefore, most of the time of each node after calling the communication API is spent waiting. To reduce the time bubbles during training, pipeline parallelism divides the mini-batch into micro-batch. This enables the different devices to cope with the different micro-batch simultaneously.
We measured the proportion of bubbles in hybrid parallel training in different batches and parallel strategies. On hardware, we used a 4-server training set composed of 4 A40 GPUs on each server. In the evaluation, we use Megatron-LM, a well-optimized distributed training framework, to run the training task. Megatron-LM has its own strategy to partition the model according to the parallel strategy and leave fewer bubbles than the other framework without optimization. In addition, compared with other frameworks for automated distributed training, such as Alpa [
20] and GSPMD [
21], Megatron has a higher throughput, which means fewer bubbles, ensuring that BAFT is tested in a more rigorous environment.Therefore, if we can use the bubbles caused by the Megatron-LM to mask the overhead brought by checkpointing, then the same mechanism can be effectively plugged into other distributed training frameworks, which causes more time bubbles.
To visualize the distribution of bubbles in distributed training, we tested different models with different parallel strategies on our cluster. Fig.3 shows the bubbles’ percentage of each device in our evaluation. As for the configuration, 4n4g denotes that our cluster is constructed by 4 servers on which there are 4 GPUs. The following string, such as d2m2p4, denotes the scale of different parallelism. For instance, d2m2p4 means there are 2 GPUs in a data parallel group, 2 GPUs in a tensor parallel group, and 4 GPUs in a pipeline parallel group, respectively.
Fig.3 Bubble percentages of different configurations. (a) g128m8; (b) g64m8 |
Full size|PPT slide
It can be seen that due to the differences in models held by each device and their responsibilities for computing tasks, the proportion of bubble time varies as well. In our experiment, bubble time accounts for at least as much as 20% of the time.
For larger-scale distributed training, many previous efforts have attempted to reschedule the execution order of pipelines, in order to reduce bubble during distributed training. However, these efforts still cannot completely eliminate pipeline bubble. Former researches aim to reduce the time bubble of pipeline parallelism. By re-scheduling the execution order of tasks on each device, the bubble ratio can be significantly reduced compared to the original pipeline model. Chimera [
22] conducted experiments on 2048 Nvidia A100 GPUs to train with GPT-2, and the bubble ratio is shown in the Fig.4.
Fig.4 Bubble ratio of different kinds of pipeline parallelisms in a large scale cluster |
Full size|PPT slide
In summary, network bandwidth is required between nodes when using the Snapshot+Transfer method to record checkpoints on neighboring nodes. Using pipeline parallelism for training introduces time bubbles, in which network bandwidth is idle for a long time except for transmitting activations or gradients. Therefore, we can make use of time bubbles to perform neighbor node checkpointing.
4 BAFT overview
BAFT is a distributed checkpointing framework for large-scale training tasks. On the premise of ensuring the accuracy of the model, according to the characteristics of hybrid parallelism, BAFT designs a new mechanism so that we can record checkpoints with negligible overhead. BAFT implements a chain replication, which means that each node will save its checkpoint on the neighbor node. For different types of errors, BAFT uses corresponding error recovery schemes: For failure on single GPU: Due to the existence of data parallelism in distributed training, the GPUs in the same data parallel group hold the same model weight. Therefore, for a single GPU, other GPUs in the same data parallel group provide it with a natural checkpoint of model weight. For such failures, BAFT will not invoke additional fault tolerance mechanisms during training, and will not bring overhead. For failure on server: Considering the amount of data communication, GPUs on the same server are usually assigned to the same data parallel group to avoid cross-group communication of a large amount of data affecting training efficiency. Therefore, when the failure occurs on a server, all the GPUs on it are inaccessible, so that we cannot find the data redundancy brought by data parallelism. To cover this kind of failure, BAFT saves checkpoint on the neighbor server during training to ensure the accessibility of checkpoints. As the analysis in Section 3.2 shows, each device has a certain percentage of bubbles at training time, which allows us to exploit it to mask the overhead brought by fault-tolerance mechanisms.
In this section, we present the overview of BAFT, which consists of three phases, profiling runtime information, generating checkpoint plan, and executing checkpoint function as the Fig.5 presents.
Fig.5 System overview of BAFT |
Full size|PPT slide
4.1 Profiling runtime information
Before starting the checkpointing, BAFT will collect the runtime information of each node in the first few iterations of training.
In pipeline parallelism, the training can be divided into stages, such as forward-compute, backward-compute, forward-send, forward-receive, backward-send and backward receive, etc. In each iteration, the order and duration of each stage are fixed for each node. By identifying these stages, BAFT is able to find the idle time of each device during execution and mark it as a bubble. In order to avoid the influence of checkpointing on the original communication, BAFT will record the communication data volume between the local node and the neighbor node at the bubble. The information profiled at this phase will be used later to generate a checkpoint plan.
Furthermore, during the profiling phase, the memory usage of each device at every step in the training process is recorded. When generating a checkpoint policy, BAFT will choose to use GPU snapshots or CPU snapshots based on memory usage.
4.2 Generating checkpoint plan
This phase generates the plan of training. When the training process reaches the preset steps in the checkpoint plan, BAFT will determine whether to perform checkpointing at this time or continue training.
The checkpoint plan includes:
4.2.1 Destination to accommodate the checkpoint
For each node, the checkpointing scheme needs to determine on which node the local data needs to be saved. We choose the prev node of the current node in the pipeline group. It is based on the fact that on devices that hold adjacent pipeline stages, bubbles are more closely distributed, so there is more time intersection of bubbles available for checkpointing.
4.2.2 Time to checkpoint
BAFT determines the time to checkpoint depending on the runtime information collected in the former profiling phase. The generation of Snapshot starts immediately after the weight update. And the time to start Transfer depends on the checkpoint plan generated by Algorithm 1.
For each pipeline stage, Algorithm 1 will iterate over the bubbles of the and its target determined in 4.2.1. As Fig.6 shows, in order to avoid the influence on training, BAFT calculates the time required for regular communication according to the communication’s volume and network bandwidth and reserves the corresponding time at the start and the end of the bubble. Through the Algorithm 1, the and the are generated, which contain the bubbles when the BAFT needs to call the send and receive functions, respectively.
Fig.6 Reserve the time for regular communication and pick the bubbles for Transfer |
Full size|PPT slide
We set a variable to denotes the number of parameters that can be transferred by the bubbles up to now. Every time the bubble is appended to the plan, the number of parameters caculated according to the duration of bubble and the bandwidth, will be added to , until reaches the number of parameters that need to be transferred.
It is worth noting that if the bubbles during one iteration are not enough to transmit all the local parameters to the neighbor node, the function will return the bubbles in the next iteration. Accordingly, the granularity of checkpoint scheduling becomes iterations that iterations have enough bubbles for parameter checkpointing. In other words, the checkpoint will be recorded every iterations. It allows us to mask the overhead of checkpointing completely. Since we are transferring data from previous snapshots, checkpointing across multiple iterations does not cause inconsistency in the saved parameters.
4.2.3 Parameters to be transferred
In the pipeline parallel group and model parallel group, each node accommodates different parts of the model weights. However, in the same data parallel group, all nodes contain the same weights. In order to minimize the amount of data that needs to be transferred, the data held by a data parallel group with nodes are equally divided into parts, and each node is responsible for checkpointing one of these.
Based on the previous analysis of bubbles, we already know the duration of the bubble that can be utilized each time to save data on the neighbor node. To improve the utilization of bubbles, we need to transfer as much data as possible in each bubble. To maximize the data transferred during the bubble, BAFT uses the layer as the granularity and uses the method of dynamic programming as Algorithm 2 to select the weight that needs to be transmitted this time for each bubble. The number of parameters to be transferred in the input is determined by: the bubble’s duration and the bandwidth between nodes.
4.3 Executing checkpoint function
When the training progresses to the bubble marked in the checkpoint plan, the corresponding time will be used by Transfer to back up the local parameters to the neighbor nodes. When a node receives all data from its neighbors, it is marked as complete and will not be overwritten by other data until the next round of parameter checkpointing is successful. It ensures that we have consistent data to recover from when the device crashes after transmitting a part of the parameters.
4.4 Error recovery
In order to address the issue of error recovery in distributed training, we have developed a specialized module designed to recover model parameters in the event of node failure. Within the context of the BAFT framework, each node is responsible for storing the model parameters of its respective predecessor nodes within the pipeline parallel group. Should a communication error occur, the saved model parameters can be returned for retrieval. In the event of a communication timeout from the predecessor node, the local node will proceed to examine the data parallel group to which the GPU on the predecessor node server is assigned. Following this, the local node will select one of the GPUs within the same data parallel group to serve as the recipient for the backup model parameters. Once the corresponding model parameters have been sent to the selected GPU on the predecessor node server, the model parameters are broadcasted within the same data parallel group, effectively restoring the model parameters of all GPUs on the server. This method of transmitting before broadcasting not only eliminates the cross-boundary transmission of duplicate data but also capitalizes on the high bandwidth within the node for broadcasting, thus enhancing the overall efficiency of data recovery.
5 Evaluation
5.1 BAFT implementation
We implement BAFT with Pytorch v.1.12.0. Communication between nodes is achieved by calling the APIs provided by torch.distributed. We adapted BAFT to Megatron-LM and took advantage of its scheduler to plug in checkpointing phases during regular training. Furthermore, we modified the Megatron-LM’s timer to implement the function of runtime profiling mentioned in 4.1.
5.2 Experimental set-up
To assess the impact of BAFT on training, we conducted experiments on our GPU cluster, which comprises four servers, each equipped with four NVIDIA A40 GPUs. We leveraged all 16 GPUs to perform the training tasks. In our experiments, the 4 GPUs within the same server are interconnected with PCI-E 4.0 x16 links, providing a bandwidth of 31.5 GB/s between GPUs. Communication among 4 servers is achieved using a 10 Gbps network. The models used in our evaluation include GPT-2 [
23], Bert-large, and Bert-exlarge [
24], as presented in Tab.1, with multiple configurations to evaluate the performance of BAFT. The first number in the batch size indicates the global batch size, while the second denotes the mini-batch size. The WikiText [
25] dataset was utilized as input for training the models.
Tab.1 Training configurations |
Models | Training configuration |
Batch size | Parallel strategy |
Bert-Large Bert-Exlarge GPT-2 | (128,8)(64, 8) | d2m2p4 |
d4m1d4 |
d1m4d4 |
d2m1p8 |
d1m2p8 |
Tab.2 Comparison of related work |
Name | Support different models | Support distributed training | Time overhead | No affect on convergence |
CheckFreq [14] | √ | × | Low | √ |
Deepfreeze [27] | √ | √ | Low | × |
FTPipeHD [26] | √ | √ | High | √ |
ResPipe [28] | √ | √ | High | √ |
Check-N-Run [15] | × | √ | Low | × |
CPR [29] | √ | √ | Low | × |
BAFT | √ | √ | Low | √ |
5.3 Training performance
In order to assess the practical performance of BAFT during training, we trained various models using different parallel configurations listed in Tab.1, and observed their performance differences under three conditions: no fault tolerance, fault tolerance using BAFT, and fault tolerance using FTPipeHD [
26], a fault tolerance framework designed for pipeline parallelism. Each test involved conducting 1000 iterations and calculating their average time. No fault tolerance indicates that no backup was performed during training. Furthermore, we implemented the fault-tolerant mechanism used by FTPipeHD based on pipeline parallelism characteristics, and compared it with BAFT. The bar chart in the Fig.7 illustrates the time required for one iteration under different configurations, while the line chart depicts the percentage of additional overhead incurred by BAFT compared to the no fault tolerance condition. The graph indicates that after implementing BAFT, the time overhead for each iteration was no more than 2% when compared to the no fault tolerance condition, while the backup method used by FTPipeHD without utilizing bubbles led to significant time overhead.
Fig.7 Comparison of checkpointing overhead between BAFT and FTPipeHD. (a) Batch size (128,8); (b) batch size (64,8) |
Full size|PPT slide
In the experiments of this study, we attempted various parallel configurations on an equally sized training cluster. This effectively altered the communication method between adjacent nodes in the pipeline group. In some configurations, neighboring nodes communicated via the local PCI-E, while in others, communication occurred through the network. The experimental results indicate that changing the communication method had minimal impact on the performance of BAFT.
In fact, different communication methods do not affect the amount of parameters that need to be checkpointed. Therefore, the proportion of parameters a node needs to checkpoint within its total communication volume remains constant. This also ensures that regardless of the communication method used to communicate with neighboring nodes, BAFT consistently maintains the introduced overhead’s proportion within an acceptable range of the total training time.
5.4 Time loss of error recovery
As mentioned in Section 1, the time loss mainly depends on the checkpoint interval . In our evaluation, based on the premise that the overhead brought by checkpointing can be completely hidden, we chose the highest checkpointing frequency. In most configurations, the is time of one or two iterations, and in the extreme case, three iterations. It makes the loss only a few iterations, i.e., only several seconds of training are wasted.
Across all configurations tested in our experiment, recording checkpoints on neighboring nodes was completed in only 1 to 3 iterations. Consequently, in the event of an error, the training time lost would only be the time spent on these 1 to 3 iterations, equivalent to approximately 0.6 to 5.5 s.
In addition, because checkpoints used in error recovery are kept on neighbor nodes, the failure node will only spend several seconds loading checkpoints. Thus, the time cost of error recovery can be greatly reduced. Fig.8 shows the time needed to reload the checkpoint from neighborhood in our experiments.
Fig.8 Time cost to reload checkpoint |
Full size|PPT slide
6 Related work
Checkpoint-based error-recovery technology has been researched to mask the overhead of checkpointing and alleviate the influence caused by the execution of checkpointing. CheckFreq [
14] dynamically tunes the checkpoint frequency depending on the model and the training environment. Furthermore, it divides the checkpointing procedures into two steps: copy the model state in memory, then write it to persistent storage. By pipelining these two steps, Checkfreq significantly shortens the computing stall brought by checkpointing. However, checkfreq is not optimized for the characteristics of distributed training, which results in its inability to leverage training bubbles to hide checkpointing overhead. Deepfreeze [
27] uses storage-specific lightweight API to minimize the I/O operations. It also uses sharding technology, according to the features of data parallelism, to reduce the number of parameters that need to be checkpointed. Despite numerous optimizations targeting the checkpointing process, it still incurs additional overhead and impacts training efficiency due to the lack of exploiting idle resources during training. FTPipeHD [
26] and ResPipe [
28] specify the implementation of checkpointing in pipeline parallelism by saving the model state in the neighbor nodes. They are more focused on ensuring checkpoint availability, which ensures the retrieval of necessary parameters for recovery in case of failures. However, they lack optimizations in terms of performance, these two schemes have nearly 50% overhead in the iteration in which the checkpointing function is called. Check-N-Run [
15] compresses checkpoints by means of quantization, which reduces the overhead of backups, and reduces the computation stall by decoupling checkpointing from training. According to the way the parameters are updated in the recommendation model training, Check-N-Run and CPR [
29] respectively use differential checkpointing and partial recovery to reduce the system load.
In summary, as Tab.2 showed, the existing fault-tolerant schemes for distributed training are either not optimized for parallel strategies, resulting in significant overhead, or only support specific models.
7 Conclusion
This paper introduces BAFT, a fault-tolerant framework designed for distributed training that utilizes checkpointing. BAFT is a complete recovery framework that aims to mitigate the impact of partial recovery on stale weights. By leveraging the characteristics of hybrid parallel training, BAFT optimizes the checkpointing mechanism to minimize overhead while avoiding any impact on training efficiency by the fault-tolerant framework.
In practice, BAFT first profiles several iterations to determine the duration of each step within an iteration and generates a checkpointing plan that leverages bubbles to mask overhead. We integrated BAFT into Megatron-LM and evaluated its performance by training models with different parallel strategies and batch sizes. Our results indicate that BAFT can complete model checkpoint recording with less than 1% overhead. Although initially implemented in Megatron-LM, the mechanism of BAFT can be easily adapted to other distributed training frameworks. Given the widespread use of pipeline parallelism in training large models, enabling BAFT during training is a straightforward process.
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}