Iaso: an autonomous fault-tolerant management system for supercomputers

Kai LU , Xiaoping WANG , Gen LI , Ruibo WANG , Wanqing CHI , Yongpeng LIU , Hongwei TANG , Hua FENG , Yinghui GAO

Front. Comput. Sci. ›› 2014, Vol. 8 ›› Issue (3) : 378 -390.

PDF (936KB)
Front. Comput. Sci. ›› 2014, Vol. 8 ›› Issue (3) : 378 -390. DOI: 10.1007/s11704-014-3503-1
RESEARCH ARTICLE

Iaso: an autonomous fault-tolerant management system for supercomputers

Author information +
History +
PDF (936KB)

Abstract

With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the “reliability wall”, which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay-2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.

Keywords

supercomputer / autonomous management / fault tolerant / fault management / MilkyWay-2 system

Cite this article

Download citation ▾
Kai LU, Xiaoping WANG, Gen LI, Ruibo WANG, Wanqing CHI, Yongpeng LIU, Hongwei TANG, Hua FENG, Yinghui GAO. Iaso: an autonomous fault-tolerant management system for supercomputers. Front. Comput. Sci., 2014, 8(3): 378-390 DOI:10.1007/s11704-014-3503-1

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

YangX, WangZ, XueJ, ZhouY. The reliability wall for exascale supercomputing. IEEE Transactions on Computers, 2012, 61(6): 767-779

[2]

LiY, LanZ. Proactive fault manager for high performance computing. In: Proceedings of the 35th International Conference on Dependable Systems and Networks (Fast Abstract). 2005

[3]

ShapiroMW. Self-healing in modern operating systems. Queue, 2004, 2(9): 66-75

[4]

OlinerA, StearleyJ. What supercomputers say: A study of five system logs. In: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2007, 575-584

[5]

SunX H, LanZ, LiY, JinH, ZhengZ. Towards a faultaware computing environment. High Availability and Performance ComputingWorkshop, 2008

[6]

LanZ, LiY, GujratiP, ZhengZ, ThakurR, WhiteJ. A fault diagnosis and prognosis service for teragrid clusters. In: Proceedings of Tera-Grid, 2007

[7]

WangX, LuoJ, LiuY, LiS, DongD. Component-based localization in sparse wireless networks. IEEE/ACM Transactions on Networking (ToN), 2011, 19(2): 540-548

[8]

TakemiyaH, TanakaY, SekiguchiS, OgataS, KaliaR K, NakanoA, VashishtaP. Sustainable adaptive grid supercomputing: multiscale simulation of semiconductor processing across the pacific. In: Proceedings of the ACM/IEEE SuperComputing. 2006, 23

[9]

WangX, LiuY, YangZ, LuK, LuoJ. OFA: an optimistic approach to conquer flip ambiguity in network localization. Computer Networks, 2013, 57(6): 1529-1544

[10]

Santos dT, Santos dL, FarinonF, HommaR, Andrade dR, KhairallaI, LemosF. Integrating heterogenous applications in control centers based on smart grid concepts. In: Proceedings of the 2013 IEEE PES Conference on Innovative Smart Grid Technologies Latin America (ISGT LA). 2013, 1-6

[11]

WangX, YangZ, LuoJ, ShenC. Beyond rigidity: obtain localisability with noisy ranging measurement. International Journal of Ad Hoc and Ubiquitous Computing, 2011, 8(1): 114-124

[12]

ValverdeL, RosaF, BordonsC. Design, planning and management of a hydrogen-based microgrid. IEEE Transactions on Industrial Informatics, 2013, 9(3): 1398-1404

[13]

ZhangX, ZhouF, ZhuX, SunH, PerrigA, VasilakosA V, GuanH. DFL: Secure and practical fault localization for datacenter networks. IEEE/ACM Transactions on Networking, 2013

[14]

HuebscherM C, McCannJ A. A survey of autonomic computing—degrees, models, and applications. ACM Computing Surveys (CSUR), 2008, 40(3): 7:1-7:28

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag Berlin Heidelberg

AI Summary AI Mindmap
PDF (936KB)

1126

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/