FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Wei HU , Guang-ming LIU , Yan-huang JIANG

Front. Inform. Technol. Electron. Eng ›› 2018, Vol. 19 ›› Issue (10) : 1273 -1290.

PDF (1492KB)
Front. Inform. Technol. Electron. Eng ›› 2018, Vol. 19 ›› Issue (10) : 1273 -1290. DOI: 10.1631/FITEE.1601450
Research
Research

FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Author information +
History +
PDF (1492KB)

Abstract

As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the ‘work-most’ (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond.

Keywords

High-performance computing / Proactive fault tolerance / Failure locality / Process replication / Process prefetching

Cite this article

Download citation ▾
Wei HU, Guang-ming LIU, Yan-huang JIANG. FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing. Front. Inform. Technol. Electron. Eng, 2018, 19(10): 1273-1290 DOI:10.1631/FITEE.1601450

登录浏览全文

4963

注册一个新账户 忘记密码

References

RIGHTS & PERMISSIONS

Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature

AI Summary AI Mindmap
PDF (1492KB)

Supplementary files

FITEE-1273-18010-WH_suppl_1

FITEE-1273-18010-WH_suppl_2

3010

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/