Software approaches for resilience of high performance computing systems: a survey

Jie JIA; Yi LIU; Guozhen ZHANG; Yulin GAO; Depei QIAN

doi:10.1007/s11704-022-2096-3

PDF(8331 KB)

Front. Comput. Sci. ›› 2023, Vol. 17 ›› Issue (4) : 174105. DOI: 10.1007/s11704-022-2096-3

Architecture

REVIEW ARTICLE

Software approaches for resilience of high performance computing systems: a survey

Jie JIA¹^,² ,
Yi LIU¹^,² ,
Guozhen ZHANG¹^,² ,
Yulin GAO¹^,² ,
Depei QIAN¹^,²

Author information +

History +

Abstract

With the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one of the critical challenges for large-scale HPC systems. Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs. This paper provides a comprehensive survey of existing software resilience approaches. Firstly, a classification of software resilience approaches is presented; then we introduce major approaches and techniques, including checkpointing, replication, soft error resilience, algorithm-based fault tolerance, fault detection and prediction. In addition, challenges exposed by system-scale and heterogeneous architecture are also discussed.

Keywords

resilience / high-performance computing / fault tolerance / challenge

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Jie JIA, Yi LIU, Guozhen ZHANG, Yulin GAO, Depei QIAN. Software approaches for resilience of high performance computing systems: a survey. Front. Comput. Sci., 2023, 17(4): 174105 https://doi.org/10.1007/s11704-022-2096-3

Jie Jia is a PhD candidate in School of Computer Science and Engineering, Beihang University, China. She is currently working on the fault tolerance of large-scale parallel applications. Her research interests include high performance computing, checkpointing, distributed and parallel computing

Yi Liu is a professor in School of Computer Science and Engineering, and Director of the Sino-German Joint Software Institute (JSI) at Beihang University, China. In 2000, he completed PhD in Department of Computer Science of Xi’an Jiaotong University, China. His research interests include computer architecture, HPC and new generation of network technology

Guozhen Zhang received his PhD from the School of Computer Science and Engineering, Beihang University, China. He is currently working on program debugging and fault tolerance of large-scale parallel applications. His research interests include HPC, computer architecture, distributed and parallel computing

Yulin Gao received his master degree from the School of Computer Science and Engineering, Beihang University, China. His research interests include HPC, fault tolerance

Depei Qian is a professor at the School of Computer Science and Engineering, Beihang University, China. He received his master degree from University of North Texas, USA in 1984. He is an academician of Chinese Academy of Sciences and a fellow of China Computer Federation. His research interests include innovative technologies in distributed computing, high performance computing, and computer architecture

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Dongarra J. Report on the fujitsu fugaku system. University of Tennessee-Knoxville Innovative Computing Laboratory, Tech. Rep. ICLUT-20-06, 2020

[2]	Di Martino C, Kramer W, Kalbarczyk Z, Iyer R. Measuring and understanding extreme-scale application resilience: a field study of 5, 000, 000 HPC application runs. In: Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2015, 25–36

[3]	Hursey J, Squyres J M, Mattox T I, Lumsdaine A. The design and implementation of checkpoint/restart process fault Tolerance for open MPI. In: Proceedings of 2007 IEEE International Parallel and Distributed Processing Symposium. 2007, 1–8

[4]	Cappello F, Geist A, Gropp B, Kale L, Kramer B, Snir M . Toward exascale resilience. The International Journal of High Performance Computing Applications, 2009, 23( 4): 374–388

[5]	Egwutuoha I P, Levy D, Selic B, Chen S . A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 2013, 65( 3): 1302–1326

[6]

Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J, et al. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, 2008, 15: 181

[7]	Gupta S, Patel T, Engelmann C, Tiwari D. Failures in large scale systems: Long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 44

[8]	Radojkovic P, Marazakis M, Carpenter P, Jeyapaul R, Gizopoulos D, Schulz M, Armejach A, Ayguade E A, Bodin F, Canal R, et al. Towards resilient EU HPC systems: A blueprint. PhD thesis, European HPC resilience initiative, 2020

[9]	Avizienis A, Laprie J C, Randell B, Landwehr C . Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 2004, 1( 1): 11–33

[10]	Mukherjee S. Architecture Design for Soft Errors. San Francisco: Morgan Kaufmann, 2008

[11]	Tan L, DeBardeleben N. Failure analysis and quantification for contemporary and future supercomputers. 2019, arXiv preprint arXiv: 1911.02118

[12]	Shoji F, Matsui S, Okamoto M, Sueyasu F, Tsukamoto T, Uno A, Yamamoto K. Long term failure analysis of 10 peta-scale supercomputer. In: Proceedings of HPC in Asia Session at ISC 2015. 2015

[13]	Das A, Mueller F, Siegel C, Vishnu A. Desh: deep learning for system health prediction of lead times to failure in HPC. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. 2018, 40–51

[14]	Di Martino C, Kalbarczyk Z, Iyer R K, Baccanico F, Fullop J, Kramer W. Lessons learned from the analysis of system failures at petascale: the case of blue waters. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 610–621

[15]	El-Sayed N, Schroeder B. Reading between the lines of failure logs: understanding how HPC systems fail. In: Proceedings of the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2013, 1–12

[16]	Bode B, Butler M, Dunning T, Hoeer T, Kramer W, Gropp W, WenMei H. The blue waters super-system for super-science. In: Contemporary High Performance Computing: From Petascale toward Exascale, 339–366. Chapman and Hall/CRC, 2013

[17]	Bland B. Titan - Early experience with the titan system at oak ridge national laboratory. In: Proceedings of 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. 2012, 2189–2211

[18]	Bautista-Gomez L, Gainaru A, Perarnau S, Tiwari D, Gupta S, Engelmann C, Cappello F, Snir M. Reducing waste in extreme scale systems through introspective analysis. In: Proceedings of 2016 IEEE International Parallel and Distributed Processing Symposium. 2016, 212–221

[19]	Tiwari D, Gupta S, Vazhkudai S S. Lazy checkpointing: exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 25–36

[20]	Tiwari D, Gupta S, Gallarno G, Rogers J, Maxwell D. Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 1−12

[21]	Hargrove P H, Duell J C . Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conference Series, 2006, 46: 494–499

[22]	Ansel J, Arya K, Cooperman G. DMTCP: Transparent checkpointing for cluster computations and the desktop. In: Proceedings of 2009 IEEE International Symposium on Parallel & Distributed Processing. 2009, 1−12

[23]	Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S. FTI: High performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 2011, 1−12

[24]	Zhong H, Nieh J. Crak: Linux checkpoint/restart as a kernel module. Technical Report, Citeseer, 2001

[25]	Osman S, Subhraveti D, Su G, Nieh J . The design and implementation of zap: a system for migrating computing environments. ACM SIGOPS Operating Systems Review, 2002, 36( S1): 361–376

[26]	Sankaran S, Squyres J M, Barrett B, Sahay V, Lumsdaine A, Duell J, Hargrove P, Roman E . The LAM/MPI checkpoint/restart framework: system-initiated checkpointing. The International Journal of High Performance Computing Applications, 2005, 19( 4): 479–493

[27]	Wang C, Mueller F, Engelmann C, Scott S L. Hybrid checkpointing for MPI jobs in HPC environments. In: Proceedings of the 16th International Conference on Parallel and Distributed Systems. 2010, 524−533

[28]	Sancho J C, Petrini F, Johnson G, Frachtenberg E. On the feasibility of incremental checkpointing for scientific computing. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium. 2004, 58

[29]	Agarwal S, Garg R, Gupta M S, Moreira J E. Adaptive incremental checkpointing for massively parallel systems. In: Proceedings of the 18th Annual International Conference on Supercomputing. 2004, 277−286

[30]	Bosilca G, Bouteiller A, Cappello F, Djilali S, Fedak G, Germain C, Herault T, Lemarinier P, Lodygensky O, Magniette F, Neri V, Selikhov A. MPICh-V: toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of 2002 ACM/IEEE Conference on Supercomputing. 2002, 29

[31]	Bronevetsky G, Marques D, Pingali K, Stodghill P. Automated application-level checkpointing of MPI programs. In: Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2003, 84−94

[32]	Graham R L, Choi S E, Daniel D J, Desai N N, Minnich R G, Rasmussen C E, Risinger L D, Sukalski M W . A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming, 2003, 31( 4): 285–303

[33]	Woo N, Choi S, Jung h, Moon J, Yeom H Y, Park T, Park H. MPICH-GF: providing fault tolerance on grid environments. In: Proceedings of the 3rd IEEE//ACM International Symposium on Cluster Computing and the Grid (CCGrid2003), the Poster and Research Demo Session. 2003

[34]	Zheng G, Shi L, Kale L V. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: Proceedings of 2004 IEEE International Conference on Cluster Computing. 2004, 93−103

[35]	Zhang Y, Wong D, Zheng W . User-level checkpoint and recovery for LAM/MPI. ACM SIGOPS Operating Systems Review, 2005, 39( 3): 72–81

[36]	Buntinas D, Coti C, Herault T, Lemarinier P, Pilard L, Rezmerita A, Rodriguez E, Cappello F . Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols. Future Generation Computer Systems, 2008, 24( 1): 73–84

[37]	Ruscio J F, Heffner M A, Varadarajan S. DejaVu: transparent user-level checkpointing, migration, and recovery for distributed systems. In: Proceedings of 2007 IEEE International Parallel and Distributed Processing Symposium. 2007, 1−10

[38]	Cao J, Arya K, Garg R, Matott S, Panda D K, Subramoni H, Vienne J, Cooperman G. System-level scalable checkpoint-restart for petascale computing. In: Proceedings of the 22nd International Conference on Parallel and Distributed Systems. 2016, 932−941

[39]	Garg R, Price G, Cooperman G. MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. 2019, 49−60

[40]	Laguna I, Richards D F, Gamblin T, Schulz M, De Supinski B R, Mohror K, Pritchard H . Evaluating and extending user-level fault tolerance in MPI applications. The International Journal of High Performance Computing Applications, 2016, 30( 3): 305–319

[41]	Chakraborty S, Laguna I, Emani M, Mohror K, Panda D K, Schulz M, Subramoni H . EREINIT: scalable and efficient fault-tolerance for bulk-synchronous MPI applications. Concurrency and Computation: Practice and Experience, 2020, 32( 3): e4863

[42]	Georgakoudis G, Guo L, Laguna I. Reinit⁺⁺: evaluating the performance of global-restart recovery methods for MPI fault tolerance. In: Proceedings of the 35th International Conference on High Performance Computing. 2020, 536−554

[43]	Bronevetsky G, Marques D J, Pingali K K, Rugina R, McKee S A. Compiler-enhanced incremental checkpointing for OpenMP applications. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2008, 275−276

[44]	Arora R, Bangalore P, Mernik M . A technique for non-invasive application-level checkpointing. The Journal of Supercomputing, 2011, 57( 3): 227–255

[45]	Ba T N, Arora R. A tool for semi-automatic application-level check- pointing. In: Technical Posters at the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 16–20

[46]	Quinlan D, Liao C. The ROSE source-to-source compiler infrastructure. In: Proceedings of the Cetus Users and Compiler Infrastructure Workshop. 2011, 1−3

[47]	Shahzad F, Thies J, Kreutzer M, Zeiser T, Hager G, Wellein G . CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE Transactions on Parallel and Distributed Systems, 2019, 30( 3): 501–514

[48]	Takizawa H, Sato K, Komatsu K, Kobayashi H. CheCUDA: a checkpoint/restart tool for CUDA applications. In: Proceedings of 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies. 2009, 408−413

[49]	Garg R. Extending the domain of transparent checkpoint-restart for large-scale HPC. Northeastern University, Dissertation, 2019

[50]	Garg R, Mohan A, Sullivan M, Cooperman G. CRUM: checkpoint-restart support for CUDA’s unified memory. In: Proceedings of 2018 IEEE International Conference on Cluster Computing. 2018, 302−313

[51]	Jain T, Cooperman G. CRAC: Checkpoint-restart architecture for CUDA with streams and UVM. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2020, 1−15

[52]	Lee K, Sullivan M B, Hari S K S, Tsai T, Keckler S W, Erez M. GPU snapshot: checkpoint offloading for GPU-dense systems. In: Proceedings of the ACM International Conference on Supercomputing. 2019, 171−183

[53]	Kannan S, Farooqui N, Gavrilovska A, Schwan K. HeteroCheckpoint: efficient checkpointing for accelerator-based systems. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 738−743

[54]	Vaidya N H. A case for two-level distributed recovery schemes. In: Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. 1995, 64−73

[55]	Haines J, Lakamraju V, Koren I, Krishna C M. Application-level fault tolerance as a complement to system-level fault tolerance. The Journal of Supercomputing, 2000, 16(1−2): 53−68

[56]	Di S, Robert Y, Vivien F, Cappello F . Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 1): 244–259

[57]	Benoit A, Cavelan A, Le Fèvre V, Robert Y, Sun H . Towards optimal multi-level checkpointing. IEEE Transactions on Computers, 2017, 66( 7): 1212–1226

[58]

Ferreira K, Stearley J, Laros J H, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges P G, Arnold D. Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 2011, 1−12

[59]	Wu P, Ding C, Chen L, Gao F, Davies T, Karlsson C, Chen Z. Fault tolerant matrix-matrix multiplication: Correcting soft errors on-line. In: Proceedings of the 2nd Workshop on Scalable Algorithms for Large-Scale Systems. 2011, 25−28

[60]	Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira K, Brightwell R. Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012, 1−12

[61]	Wang Z, Yang X, Zhou Y. MMPI: a scalable fault tolerance mechanism for MPI large scale parallel computing. In: Proceedings of the 10th IEEE International Conference on Computer and Information Technology. 2010, 1251−1256

[62]	Hussain Z, Znati T, Melhem R. Partial redundancy in HPC systems with non-uniform node reliabilities. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 566−576

[63]	Elliott J, Kharbas K, Fiala D, Mueller F, Ferreira K, Engelmann C. Combining partial redundancy and checkpointing for HPC. In: Proceedings of the 32nd International Conference on Distributed Computing Systems. 2012, 615−626

[64]	George C, Vadhiyar S . Fault tolerance on large scale systems using adaptive process replication. IEEE Transactions on Computers, 2015, 64( 8): 2213–2225

[65]	Quinn H, Graham P. Terrestrial-based radiation upsets: a cautionary tale. In: Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 2005, 193−202

[66]	Schroeder B, Pinheiro E, Weber W D . DRAM errors in the wild: a large-scale field study. Communications of the ACM, 2011, 54( 2): 100–107

[67]	Sedaghat Y, Miremadi S G, Fazeli M. A software-based error detection technique using encoded signatures. In: Proceedings of the 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 2006, 389−400

[68]	Miremadi G, Harlsson J, Gunneflo U, Torin J. Two software techniques for on-line error detection. In: Proceedings of the 22nd International Symposium on Fault-Tolerant Computing. 1992, 328−335

[69]	Vemu R, Abraham J . CEDA: control-flow error detection using assertions. IEEE Transactions on Computers, 2011, 60( 9): 1233–1245

[70]	Zarandi H R, Maghsoudloo M, Khoshavi N. Two efficient software techniques to detect and correct control-flow errors. In: Proceedings of the 16th Pacific Rim International Symposium on Dependable Computing. 2010, 141−148

[71]	Gomez L B, Cappello F . Detecting silent data corruption through data dynamic monitoring for scientific applications. ACM SIGPLAN Notices, 2014, 49( 8): 381–382

[72]	Berrocal E, Bautista-Gomez L, Di S, Lan Z, Cappello F. Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. 2015, 275−278

[73]	LeBlanc T, Anand R, Gabriel E, Subhlok J. VolpexMPI: an MPI library for execution of parallel applications on volatile nodes. In: Proceedings of the 16th European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting. 2009, 124−133

[74]	Engelmann C, Boehm S. Redundant execution of HPC applications with MR-MPI. In: Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks. 2011, 31−38

[75]	Berrocal E, Bautista-Gomez L, Di S, Lan Z, Cappello F . Toward general software level silent data corruption detection for parallel applications. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 12): 3642–3655

[76]	Fiala D, Ferreira K B, Mueller F, Engelmann C. A tunable, software-based DRAM error detection and correction library for HPC. In: Proceedings of European Conference on Parallel Processing. 2012, 251−261

[77]	Fiala D, Mueller F, Ferreira K B. FlipSphere: a software-based DRAM error detection and correction library for HPC. In: Proceedings of the 20th International Symposium on Distributed Simulation and Real Time Applications. 2016, 19−28

[78]	Fiala D, Mueller F, Ferreira K, Engelmann C. Mini-Ckpts: surviving OS failures in persistent memory. In: Proceedings of 2016 International Conference on Supercomputing. 2016, 7

[79]	Huang K H, Abraham J A . Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 1984, C-33( 6): 518–528

[80]	Luk F T, Park H . Fault-tolerant matrix triangularizations on systolic arrays. IEEE Transactions on Computers, 1988, 37( 11): 1434–1438

[81]	Luk F T, Park H . An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing, 1988, 5( 2): 172–184

[82]	Bouteiller A, Herault T, Bosilca G, Du P, Dongarra J . Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Transactions on Parallel Computing, 2015, 1( 2): 10

[83]	Chen Z . Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. ACM SIGPLAN Notices, 2013, 48( 8): 167–176

[84]	Tao D, Song S L, Krishnamoorthy S, Wu P, Liang X, Zhang E Z, Kerbyson D, Chen Z. New-sum: a novel online ABFT scheme for general iterative methods. In: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. 2016, 43−55

[85]	Schöll A, Braun C, Kochte M A, Wunderlich H J. Efficient algorithm-based fault tolerance for sparse matrix operations. In: Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2016, 251−262

[86]	Shantharam M, Srinivasmurthy S, Raghavan P. Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM International Conference on Supercomputing. 2012, 69−78

[87]

Zhu Y, Liu Y, Li M, Qian D. Block-checksum-based fault tolerance for matrix multiplication on large-scale parallel systems. In: Proceedings of the 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems. 2018, 172−179

[88]	Zhu Y, Liu Y, Zhang G . FT-PBLAS: PBLAS-based fault-tolerant linear algebra computation on high-performance computing systems. IEEE Access, 2020, 8: 42674–42688

[89]	Chen Z, Dongarra J . Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 2008, 19( 12): 1628–1641

[90]	Roche T, Cunche M, Roch J L. Algorithm-based fault tolerance applied to P2P computing networks. In: Proceedings of the 1st International Conference on Advances in P2P Systems. 2009, 144−149

[91]	Hakkarinen D, Wu P, Chen Z . Fail-stop failure algorithm-based fault tolerance for Cholesky decomposition. IEEE Transactions on Parallel and Distributed Systems, 2015, 26( 5): 1323–1335

[92]	Davies T, Karlsson C, Liu H, Ding C, Chen Z. High performance linpack benchmark: a fault tolerant implementation without checkpointing. In: Proceedings of the International Conference on Supercomputing. 2011, 162−171

[93]	Chen J, Li S, Chen Z. GPU-ABFT: optimizing algorithm-based fault tolerance for heterogeneous systems with GPUs. In: Proceedings of 2016 IEEE International Conference on Networking, Architecture and Storage. 2016, 1−2

[94]	Chen J, Li H, Li S, Liang X, Wu P, Tao D, Ouyang K, Liu Y, Zhao K, Guan Q, Chen Z. Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 854−865

[95]	Braun C, Halder S, Wunderlich H J. A-ABFT: autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 443−454

[96]	Ranganathan S, George A D, Todd R W, Chidester M C . Gossip-style failure detection and distributed consensus for scalable heterogeneous clusters. Cluster Computing, 2001, 4( 3): 197–209

[97]	Gabel M, Schuster A, Bachrach R G, Bjørner N. Latent fault detection in large scale services. In: Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks. 2012, 1−12

[98]	Wu L, Luo H, Zhan J, Meng D. A runtime fault detection method for HPC cluster. In: Proceedings of the 12th International Conference on Parallel and Distributed Computing, Applications and Technologies. 2011, 68−72

[99]	Ghiasvand S, Ciorba F M. Anomaly detection in high performance computers: a vicinity perspective. In: Proceedings of the 18th International Symposium on Parallel and Distributed Computing. 2019, 112−120

[100]

Egwutuoha I P, Chen S, Levy D, Selic B, Calvo R . Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud. International Journal of Parallel, Emergent and Distributed Systems, 2014, 29( 4): 363–378

[101]

Borghesi A, Libri A, Benini L, Bartolini A. Online anomaly detection in HPC systems. In: Proceedings of 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems. 2019, 229−233

[102]

Borghesi A, Molan M, Milano M, Bartolini A . Anomaly detection and anticipation in high performance computing systems. IEEE Transactions on Parallel and Distributed Systems, 2022, 33( 4): 739–750

[103]

Dani M C, Doreau H, Alt S. K-means application for anomaly detection and log classification in HPC. In: Proceedings of the 30th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. 2017, 201−210

[104]

Zhu B, Wang G, Liu X, Hu D, Lin S, Ma J. Proactive drive failure prediction for large scale storage systems. In: Proceedings of the 29th Symposium on Mass Storage Systems and Technologies. 2013, 1−5

[105]

Fulp E W, Fink G A, Haack J N. Predicting computer system failures using support vector machines. In: Proceedings of the 1st USENIX Conference on Analysis of System Logs. 2008, 5

[106]

Ganguly S, Consul A, Khan A, Bussone B, Richards J, Miguel A. A practical approach to hard disk failure prediction in cloud platforms: big data model for failure management in datacenters. In: Proceedings of the 2nd International Conference on Big Data Computing Service and Applications. 2016, 105−116

[107]

Krammer B, Bidmon K, Müller M S, Resch M M . MARMOT: an MPI analysis and checking tool. Advances in Parallel Computing, 2004, 13: 493–500

[108]

Vetter J S, De Supinski B R. Dynamic software testing of MPI applications with Umpire. In: Proceedings of 2000 ACM/IEEE Conference on Supercomputing. 2000, 51

[109]

Gao J, Yu K, Qing P. A scalable runtime fault detection mechanism for high performance computing. In: Proceedings of the 2nd Information Technology, Networking, Electronic and Automation Control Conference. 2017, 490−495

[110]

Kharbas K, Kim D, Hoefler T, Mueller F. Assessing HPC failure detectors for MPI jobs. In: Proceedings of the 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing. 2012, 81−88

[111]

Liang Y, Zhang Y, Sivasubramaniam A, Jette M, Sahoo R. BlueGene/L failure analysis and prediction models. In: Proceedings of the International Conference on Dependable Systems and Networks. 2006, 425−434

[112]

Gainaru A, Cappello F, Snir M, Kramer W. Fault prediction under the microscope: a closer look into HPC systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012, 1−11

[113]

Gainaru A, Cappello F, Kramer W. Taming of the shrew: modeling the normal and faulty behaviour of large-scale HPC systems. In: Proceedings of the 26th International Parallel and Distributed Processing Symposium. 2012, 1168−1179

[114]

Pelaez A, Quiroz A, Browne J C, Chuah E, Parashar M. Online failure prediction for HPC resources using decentralized clustering. In: Proceedings of the 21st International Conference on High Performance Computing. 2014, 1−9

[115]

Gunawi H S, Suminto R O, Sears R, Golliher C, Sundararaman S, , . Fail-slow at scale: evidence of hardware performance faults in large production systems. In: Proceedings of the 16th USENIX Conference on File and Storage Technologies. 2018, 1−14

Acknowledgements

The research presented in this paper has been supported by the GHFund A (No. ghfund202107010337).

RIGHTS & PERMISSIONS

2023 Higher Education Press

AI Summary AI Mindmap

PDF(8331 KB)

Accesses

Citations

Detail

Sections

Recommended

Received	Accepted	Published
16 Feb 2022	06 Sep 2022	15 Aug 2023
Just Accepted Date	Issue Date
07 Sep 2022	12 Dec 2022

About the journal

Aims & scope

Description

Editorial board

Abstracting / Indexing

Contact us

Browse

Just accepted

Online first

Latest issue

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submisson

Call for papers

Guidelines for authors

Download templates

Guidelines for reviewers