1 Introduction
2 Background
2.1 Resilience problem of HPC systems
Tab.1 The MTBF of different HPC systems [7] |
System | MTBF(hours) | Cores | Nodes |
---|---|---|---|
Jaguar XT4 | 36.91 | 31,328 | 7,832 |
Jaguar XT5 | 22.67 | 149,504 | 18,688 |
Jaguar XK5 | 8.93 | 298,592 | 18,688 |
Titan XK7 | 14.51 | 560,640 | 18,688 |
2.2 Malfunctions of HPC systems
Tab.2 The terminology of HPC malfunctions |
Concept | Explaining |
---|---|
Fault | Root cause of an error, usually physical defects or software bugs |
Error | Deviation from the expected result |
Failure | Fails to deliver correct service |
Tab.3 Abnormal states of HPC systems |
Class | Meaning | Typical examples | Explanation |
---|---|---|---|
Fail-stop failure | Hardware and/or software stop working | Kernel panic | Kernel error from which the operating system cannot quickly recover. |
Node heartbeat fault | Exception when accepting the heartbeat from other nodes. | ||
Traps | Segmentation faults, trap invalid opcode. | ||
GFS failure | Failure of the global file system. | ||
Scheduler | Internal bugs of job scheduler. | ||
Acc failure | Failure of accelerators or co-processors. | ||
Storage failure | Storage system fails to work. | ||
Node hardware failure | Node fails due to power/cooling-system error, damage of hardware components, etc. | ||
Interconnect conjunction | Network connection is congested. | ||
Soft error / Fail-continue error | System still works but the execution of application incorrect | SDC | Undetected silent data corruption. |
CFE | Control flow error. | ||
MCE | Memory check exception. |
3 Classification of resilience approaches
Tab.4 Classification of typical resilience approaches |
Resilience method | Checkpointing | Replication | Soft error resilience | ABFT | Fault detection and prediction |
---|---|---|---|---|---|
Redundancy data | System memory or application data space | Process data and message | N/A | Checksum of algorithm | N/A |
Recovery method | Failure-rollback | Forward recovery | Error-restart | Error-restart | N/A |
Overhead/cost | Medium | High | Medium | Low | Low |
Generality | Systems and applications | Systems and applications | Systems and applications | Applications | Systems and applications |
Ease of use or deployment | Easy | Easy | Hard | Hard | Medium |
Limitation | Scalability | Resource consumption and scalability | Soft error only | Algorithm-dependent | Rely on other recovery methods |
4 Checkpointing
Tab.5 Comparison of different checkpointing level |
Checkpointing level | System-level | User-level | Application-level |
---|---|---|---|
Explanation | Operating system in charge of checkpointing. | A user-level library is responsible for checkpointing and links to applications | The application itself is in charge of checkpointing. |
Typical systems | BLCR [21] | DMTCP [22] | FTI [23] |
Checkpointing data | Status of entire system | Status of entire application | user-specified application status |
Overhead | High | Medium | Low |
Transparency | Transparent to applications | Application needs to be loaded or linked with checkpoint library | Application needs to be modified |
Portability | Low | Medium | High |
4.1 System-level checkpointing
4.2 User-level checkpointing
4.3 Application-level checkpointing
4.4 Heterogeneous checkpointing
4.5 Multi-level checkpointing
5 Replication
6 Soft error resilience
6.1 Control flow error
6.2 Silent data corruption
Tab.6 Software solutions for SDC challenge |
Approach | Advantages | Disadvantages |
---|---|---|
Checkpoint/restart | No hardware features required, less or no program modification | Requires large storage space and high time overhead |
Replication | Simple and straightforward | High overhead, including running time, computing resources |
ABFT | Low-overhead | Required program code modifications, and poor portability |