Software approaches for resilience of high performance computing systems: a survey

Jie JIA , Yi LIU , Guozhen ZHANG , Yulin GAO , Depei QIAN

Front. Comput. Sci. ›› 2023, Vol. 17 ›› Issue (4) : 174105

PDF (8331KB)
Front. Comput. Sci. ›› 2023, Vol. 17 ›› Issue (4) : 174105 DOI: 10.1007/s11704-022-2096-3
Architecture
REVIEW ARTICLE

Software approaches for resilience of high performance computing systems: a survey

Author information +
History +
PDF (8331KB)

Abstract

With the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one of the critical challenges for large-scale HPC systems. Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs. This paper provides a comprehensive survey of existing software resilience approaches. Firstly, a classification of software resilience approaches is presented; then we introduce major approaches and techniques, including checkpointing, replication, soft error resilience, algorithm-based fault tolerance, fault detection and prediction. In addition, challenges exposed by system-scale and heterogeneous architecture are also discussed.

Keywords

resilience / high-performance computing / fault tolerance / challenge

Cite this article

Download citation ▾
Jie JIA, Yi LIU, Guozhen ZHANG, Yulin GAO, Depei QIAN. Software approaches for resilience of high performance computing systems: a survey. Front. Comput. Sci., 2023, 17(4): 174105 DOI:10.1007/s11704-022-2096-3

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Dongarra J. Report on the fujitsu fugaku system. University of Tennessee-Knoxville Innovative Computing Laboratory, Tech. Rep. ICLUT-20-06, 2020

[2]

Di Martino C, Kramer W, Kalbarczyk Z, Iyer R. Measuring and understanding extreme-scale application resilience: a field study of 5, 000, 000 HPC application runs. In: Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2015, 25–36

[3]

Hursey J, Squyres J M, Mattox T I, Lumsdaine A. The design and implementation of checkpoint/restart process fault Tolerance for open MPI. In: Proceedings of 2007 IEEE International Parallel and Distributed Processing Symposium. 2007, 1–8

[4]

Cappello F, Geist A, Gropp B, Kale L, Kramer B, Snir M . Toward exascale resilience. The International Journal of High Performance Computing Applications, 2009, 23( 4): 374–388

[5]

Egwutuoha I P, Levy D, Selic B, Chen S . A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 2013, 65( 3): 1302–1326

[6]

Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J, et al. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, 2008, 15: 181

[7]

Gupta S, Patel T, Engelmann C, Tiwari D. Failures in large scale systems: Long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 44

[8]

Radojkovic P, Marazakis M, Carpenter P, Jeyapaul R, Gizopoulos D, Schulz M, Armejach A, Ayguade E A, Bodin F, Canal R, et al. Towards resilient EU HPC systems: A blueprint. PhD thesis, European HPC resilience initiative, 2020

[9]

Avizienis A, Laprie J C, Randell B, Landwehr C . Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 2004, 1( 1): 11–33

[10]

Mukherjee S. Architecture Design for Soft Errors. San Francisco: Morgan Kaufmann, 2008

[11]

Tan L, DeBardeleben N. Failure analysis and quantification for contemporary and future supercomputers. 2019, arXiv preprint arXiv: 1911.02118

[12]

Shoji F, Matsui S, Okamoto M, Sueyasu F, Tsukamoto T, Uno A, Yamamoto K. Long term failure analysis of 10 peta-scale supercomputer. In: Proceedings of HPC in Asia Session at ISC 2015. 2015

[13]

Das A, Mueller F, Siegel C, Vishnu A. Desh: deep learning for system health prediction of lead times to failure in HPC. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. 2018, 40–51

[14]

Di Martino C, Kalbarczyk Z, Iyer R K, Baccanico F, Fullop J, Kramer W. Lessons learned from the analysis of system failures at petascale: the case of blue waters. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 610–621

[15]

El-Sayed N, Schroeder B. Reading between the lines of failure logs: understanding how HPC systems fail. In: Proceedings of the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2013, 1–12

[16]

Bode B, Butler M, Dunning T, Hoeer T, Kramer W, Gropp W, WenMei H. The blue waters super-system for super-science. In: Contemporary High Performance Computing: From Petascale toward Exascale, 339–366. Chapman and Hall/CRC, 2013

[17]

Bland B. Titan - Early experience with the titan system at oak ridge national laboratory. In: Proceedings of 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. 2012, 2189–2211

[18]

Bautista-Gomez L, Gainaru A, Perarnau S, Tiwari D, Gupta S, Engelmann C, Cappello F, Snir M. Reducing waste in extreme scale systems through introspective analysis. In: Proceedings of 2016 IEEE International Parallel and Distributed Processing Symposium. 2016, 212–221

[19]

Tiwari D, Gupta S, Vazhkudai S S. Lazy checkpointing: exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 25–36

[20]

Tiwari D, Gupta S, Gallarno G, Rogers J, Maxwell D. Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 1−12

[21]

Hargrove P H, Duell J C . Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conference Series, 2006, 46: 494–499

[22]

Ansel J, Arya K, Cooperman G. DMTCP: Transparent checkpointing for cluster computations and the desktop. In: Proceedings of 2009 IEEE International Symposium on Parallel & Distributed Processing. 2009, 1−12

[23]

Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S. FTI: High performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 2011, 1−12

[24]

Zhong H, Nieh J. Crak: Linux checkpoint/restart as a kernel module. Technical Report, Citeseer, 2001

[25]

Osman S, Subhraveti D, Su G, Nieh J . The design and implementation of zap: a system for migrating computing environments. ACM SIGOPS Operating Systems Review, 2002, 36( S1): 361–376

[26]

Sankaran S, Squyres J M, Barrett B, Sahay V, Lumsdaine A, Duell J, Hargrove P, Roman E . The LAM/MPI checkpoint/restart framework: system-initiated checkpointing. The International Journal of High Performance Computing Applications, 2005, 19( 4): 479–493

[27]

Wang C, Mueller F, Engelmann C, Scott S L. Hybrid checkpointing for MPI jobs in HPC environments. In: Proceedings of the 16th International Conference on Parallel and Distributed Systems. 2010, 524−533

[28]

Sancho J C, Petrini F, Johnson G, Frachtenberg E. On the feasibility of incremental checkpointing for scientific computing. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium. 2004, 58

[29]

Agarwal S, Garg R, Gupta M S, Moreira J E. Adaptive incremental checkpointing for massively parallel systems. In: Proceedings of the 18th Annual International Conference on Supercomputing. 2004, 277−286

[30]

Bosilca G, Bouteiller A, Cappello F, Djilali S, Fedak G, Germain C, Herault T, Lemarinier P, Lodygensky O, Magniette F, Neri V, Selikhov A. MPICh-V: toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of 2002 ACM/IEEE Conference on Supercomputing. 2002, 29

[31]

Bronevetsky G, Marques D, Pingali K, Stodghill P. Automated application-level checkpointing of MPI programs. In: Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2003, 84−94

[32]

Graham R L, Choi S E, Daniel D J, Desai N N, Minnich R G, Rasmussen C E, Risinger L D, Sukalski M W . A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming, 2003, 31( 4): 285–303

[33]

Woo N, Choi S, Jung h, Moon J, Yeom H Y, Park T, Park H. MPICH-GF: providing fault tolerance on grid environments. In: Proceedings of the 3rd IEEE//ACM International Symposium on Cluster Computing and the Grid (CCGrid2003), the Poster and Research Demo Session. 2003

[34]

Zheng G, Shi L, Kale L V. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: Proceedings of 2004 IEEE International Conference on Cluster Computing. 2004, 93−103

[35]

Zhang Y, Wong D, Zheng W . User-level checkpoint and recovery for LAM/MPI. ACM SIGOPS Operating Systems Review, 2005, 39( 3): 72–81

[36]

Buntinas D, Coti C, Herault T, Lemarinier P, Pilard L, Rezmerita A, Rodriguez E, Cappello F . Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols. Future Generation Computer Systems, 2008, 24( 1): 73–84

[37]

Ruscio J F, Heffner M A, Varadarajan S. DejaVu: transparent user-level checkpointing, migration, and recovery for distributed systems. In: Proceedings of 2007 IEEE International Parallel and Distributed Processing Symposium. 2007, 1−10

[38]

Cao J, Arya K, Garg R, Matott S, Panda D K, Subramoni H, Vienne J, Cooperman G. System-level scalable checkpoint-restart for petascale computing. In: Proceedings of the 22nd International Conference on Parallel and Distributed Systems. 2016, 932−941

[39]

Garg R, Price G, Cooperman G. MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. 2019, 49−60

[40]

Laguna I, Richards D F, Gamblin T, Schulz M, De Supinski B R, Mohror K, Pritchard H . Evaluating and extending user-level fault tolerance in MPI applications. The International Journal of High Performance Computing Applications, 2016, 30( 3): 305–319

[41]

Chakraborty S, Laguna I, Emani M, Mohror K, Panda D K, Schulz M, Subramoni H . EREINIT: scalable and efficient fault-tolerance for bulk-synchronous MPI applications. Concurrency and Computation: Practice and Experience, 2020, 32( 3): e4863

[42]

Georgakoudis G, Guo L, Laguna I. Reinit++: evaluating the performance of global-restart recovery methods for MPI fault tolerance. In: Proceedings of the 35th International Conference on High Performance Computing. 2020, 536−554

[43]

Bronevetsky G, Marques D J, Pingali K K, Rugina R, McKee S A. Compiler-enhanced incremental checkpointing for OpenMP applications. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2008, 275−276

[44]

Arora R, Bangalore P, Mernik M . A technique for non-invasive application-level checkpointing. The Journal of Supercomputing, 2011, 57( 3): 227–255

[45]

Ba T N, Arora R. A tool for semi-automatic application-level check- pointing. In: Technical Posters at the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 16–20

[46]

Quinlan D, Liao C. The ROSE source-to-source compiler infrastructure. In: Proceedings of the Cetus Users and Compiler Infrastructure Workshop. 2011, 1−3

[47]

Shahzad F, Thies J, Kreutzer M, Zeiser T, Hager G, Wellein G . CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE Transactions on Parallel and Distributed Systems, 2019, 30( 3): 501–514

[48]

Takizawa H, Sato K, Komatsu K, Kobayashi H. CheCUDA: a checkpoint/restart tool for CUDA applications. In: Proceedings of 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies. 2009, 408−413

[49]

Garg R. Extending the domain of transparent checkpoint-restart for large-scale HPC. Northeastern University, Dissertation, 2019

[50]

Garg R, Mohan A, Sullivan M, Cooperman G. CRUM: checkpoint-restart support for CUDA’s unified memory. In: Proceedings of 2018 IEEE International Conference on Cluster Computing. 2018, 302−313

[51]

Jain T, Cooperman G. CRAC: Checkpoint-restart architecture for CUDA with streams and UVM. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2020, 1−15

[52]

Lee K, Sullivan M B, Hari S K S, Tsai T, Keckler S W, Erez M. GPU snapshot: checkpoint offloading for GPU-dense systems. In: Proceedings of the ACM International Conference on Supercomputing. 2019, 171−183

[53]

Kannan S, Farooqui N, Gavrilovska A, Schwan K. HeteroCheckpoint: efficient checkpointing for accelerator-based systems. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 738−743

[54]

Vaidya N H. A case for two-level distributed recovery schemes. In: Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. 1995, 64−73

[55]

Haines J, Lakamraju V, Koren I, Krishna C M. Application-level fault tolerance as a complement to system-level fault tolerance. The Journal of Supercomputing, 2000, 16(1−2): 53−68

[56]

Di S, Robert Y, Vivien F, Cappello F . Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 1): 244–259

[57]

Benoit A, Cavelan A, Le Fèvre V, Robert Y, Sun H . Towards optimal multi-level checkpointing. IEEE Transactions on Computers, 2017, 66( 7): 1212–1226

[58]

Ferreira K, Stearley J, Laros J H, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges P G, Arnold D. Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 2011, 1−12

[59]

Wu P, Ding C, Chen L, Gao F, Davies T, Karlsson C, Chen Z. Fault tolerant matrix-matrix multiplication: Correcting soft errors on-line. In: Proceedings of the 2nd Workshop on Scalable Algorithms for Large-Scale Systems. 2011, 25−28

[60]

Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira K, Brightwell R. Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012, 1−12

[61]

Wang Z, Yang X, Zhou Y. MMPI: a scalable fault tolerance mechanism for MPI large scale parallel computing. In: Proceedings of the 10th IEEE International Conference on Computer and Information Technology. 2010, 1251−1256

[62]

Hussain Z, Znati T, Melhem R. Partial redundancy in HPC systems with non-uniform node reliabilities. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 566−576

[63]

Elliott J, Kharbas K, Fiala D, Mueller F, Ferreira K, Engelmann C. Combining partial redundancy and checkpointing for HPC. In: Proceedings of the 32nd International Conference on Distributed Computing Systems. 2012, 615−626

[64]

George C, Vadhiyar S . Fault tolerance on large scale systems using adaptive process replication. IEEE Transactions on Computers, 2015, 64( 8): 2213–2225

[65]

Quinn H, Graham P. Terrestrial-based radiation upsets: a cautionary tale. In: Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 2005, 193−202

[66]

Schroeder B, Pinheiro E, Weber W D . DRAM errors in the wild: a large-scale field study. Communications of the ACM, 2011, 54( 2): 100–107

[67]

Sedaghat Y, Miremadi S G, Fazeli M. A software-based error detection technique using encoded signatures. In: Proceedings of the 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 2006, 389−400

[68]

Miremadi G, Harlsson J, Gunneflo U, Torin J. Two software techniques for on-line error detection. In: Proceedings of the 22nd International Symposium on Fault-Tolerant Computing. 1992, 328−335

[69]

Vemu R, Abraham J . CEDA: control-flow error detection using assertions. IEEE Transactions on Computers, 2011, 60( 9): 1233–1245

[70]

Zarandi H R, Maghsoudloo M, Khoshavi N. Two efficient software techniques to detect and correct control-flow errors. In: Proceedings of the 16th Pacific Rim International Symposium on Dependable Computing. 2010, 141−148

[71]

Gomez L B, Cappello F . Detecting silent data corruption through data dynamic monitoring for scientific applications. ACM SIGPLAN Notices, 2014, 49( 8): 381–382

[72]

Berrocal E, Bautista-Gomez L, Di S, Lan Z, Cappello F. Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. 2015, 275−278

[73]

LeBlanc T, Anand R, Gabriel E, Subhlok J. VolpexMPI: an MPI library for execution of parallel applications on volatile nodes. In: Proceedings of the 16th European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting. 2009, 124−133

[74]

Engelmann C, Boehm S. Redundant execution of HPC applications with MR-MPI. In: Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks. 2011, 31−38

[75]

Berrocal E, Bautista-Gomez L, Di S, Lan Z, Cappello F . Toward general software level silent data corruption detection for parallel applications. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 12): 3642–3655

[76]

Fiala D, Ferreira K B, Mueller F, Engelmann C. A tunable, software-based DRAM error detection and correction library for HPC. In: Proceedings of European Conference on Parallel Processing. 2012, 251−261

[77]

Fiala D, Mueller F, Ferreira K B. FlipSphere: a software-based DRAM error detection and correction library for HPC. In: Proceedings of the 20th International Symposium on Distributed Simulation and Real Time Applications. 2016, 19−28

[78]

Fiala D, Mueller F, Ferreira K, Engelmann C. Mini-Ckpts: surviving OS failures in persistent memory. In: Proceedings of 2016 International Conference on Supercomputing. 2016, 7

[79]

Huang K H, Abraham J A . Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 1984, C-33( 6): 518–528

[80]

Luk F T, Park H . Fault-tolerant matrix triangularizations on systolic arrays. IEEE Transactions on Computers, 1988, 37( 11): 1434–1438

[81]

Luk F T, Park H . An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing, 1988, 5( 2): 172–184

[82]

Bouteiller A, Herault T, Bosilca G, Du P, Dongarra J . Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Transactions on Parallel Computing, 2015, 1( 2): 10

[83]

Chen Z . Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. ACM SIGPLAN Notices, 2013, 48( 8): 167–176

[84]

Tao D, Song S L, Krishnamoorthy S, Wu P, Liang X, Zhang E Z, Kerbyson D, Chen Z. New-sum: a novel online ABFT scheme for general iterative methods. In: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. 2016, 43−55

[85]

Schöll A, Braun C, Kochte M A, Wunderlich H J. Efficient algorithm-based fault tolerance for sparse matrix operations. In: Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2016, 251−262

[86]

Shantharam M, Srinivasmurthy S, Raghavan P. Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM International Conference on Supercomputing. 2012, 69−78

[87]

Zhu Y, Liu Y, Li M, Qian D. Block-checksum-based fault tolerance for matrix multiplication on large-scale parallel systems. In: Proceedings of the 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems. 2018, 172−179

[88]

Zhu Y, Liu Y, Zhang G . FT-PBLAS: PBLAS-based fault-tolerant linear algebra computation on high-performance computing systems. IEEE Access, 2020, 8: 42674–42688

[89]

Chen Z, Dongarra J . Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 2008, 19( 12): 1628–1641

[90]

Roche T, Cunche M, Roch J L. Algorithm-based fault tolerance applied to P2P computing networks. In: Proceedings of the 1st International Conference on Advances in P2P Systems. 2009, 144−149

[91]

Hakkarinen D, Wu P, Chen Z . Fail-stop failure algorithm-based fault tolerance for Cholesky decomposition. IEEE Transactions on Parallel and Distributed Systems, 2015, 26( 5): 1323–1335

[92]

Davies T, Karlsson C, Liu H, Ding C, Chen Z. High performance linpack benchmark: a fault tolerant implementation without checkpointing. In: Proceedings of the International Conference on Supercomputing. 2011, 162−171

[93]

Chen J, Li S, Chen Z. GPU-ABFT: optimizing algorithm-based fault tolerance for heterogeneous systems with GPUs. In: Proceedings of 2016 IEEE International Conference on Networking, Architecture and Storage. 2016, 1−2

[94]

Chen J, Li H, Li S, Liang X, Wu P, Tao D, Ouyang K, Liu Y, Zhao K, Guan Q, Chen Z. Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 854−865

[95]

Braun C, Halder S, Wunderlich H J. A-ABFT: autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 443−454

[96]

Ranganathan S, George A D, Todd R W, Chidester M C . Gossip-style failure detection and distributed consensus for scalable heterogeneous clusters. Cluster Computing, 2001, 4( 3): 197–209

[97]

Gabel M, Schuster A, Bachrach R G, Bjørner N. Latent fault detection in large scale services. In: Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks. 2012, 1−12

[98]

Wu L, Luo H, Zhan J, Meng D. A runtime fault detection method for HPC cluster. In: Proceedings of the 12th International Conference on Parallel and Distributed Computing, Applications and Technologies. 2011, 68−72

[99]

Ghiasvand S, Ciorba F M. Anomaly detection in high performance computers: a vicinity perspective. In: Proceedings of the 18th International Symposium on Parallel and Distributed Computing. 2019, 112−120

[100]

Egwutuoha I P, Chen S, Levy D, Selic B, Calvo R . Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud. International Journal of Parallel, Emergent and Distributed Systems, 2014, 29( 4): 363–378

[101]

Borghesi A, Libri A, Benini L, Bartolini A. Online anomaly detection in HPC systems. In: Proceedings of 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems. 2019, 229−233

[102]

Borghesi A, Molan M, Milano M, Bartolini A . Anomaly detection and anticipation in high performance computing systems. IEEE Transactions on Parallel and Distributed Systems, 2022, 33( 4): 739–750

[103]

Dani M C, Doreau H, Alt S. K-means application for anomaly detection and log classification in HPC. In: Proceedings of the 30th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. 2017, 201−210

[104]

Zhu B, Wang G, Liu X, Hu D, Lin S, Ma J. Proactive drive failure prediction for large scale storage systems. In: Proceedings of the 29th Symposium on Mass Storage Systems and Technologies. 2013, 1−5

[105]

Fulp E W, Fink G A, Haack J N. Predicting computer system failures using support vector machines. In: Proceedings of the 1st USENIX Conference on Analysis of System Logs. 2008, 5

[106]

Ganguly S, Consul A, Khan A, Bussone B, Richards J, Miguel A. A practical approach to hard disk failure prediction in cloud platforms: big data model for failure management in datacenters. In: Proceedings of the 2nd International Conference on Big Data Computing Service and Applications. 2016, 105−116

[107]

Krammer B, Bidmon K, Müller M S, Resch M M . MARMOT: an MPI analysis and checking tool. Advances in Parallel Computing, 2004, 13: 493–500

[108]

Vetter J S, De Supinski B R. Dynamic software testing of MPI applications with Umpire. In: Proceedings of 2000 ACM/IEEE Conference on Supercomputing. 2000, 51

[109]

Gao J, Yu K, Qing P. A scalable runtime fault detection mechanism for high performance computing. In: Proceedings of the 2nd Information Technology, Networking, Electronic and Automation Control Conference. 2017, 490−495

[110]

Kharbas K, Kim D, Hoefler T, Mueller F. Assessing HPC failure detectors for MPI jobs. In: Proceedings of the 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing. 2012, 81−88

[111]

Liang Y, Zhang Y, Sivasubramaniam A, Jette M, Sahoo R. BlueGene/L failure analysis and prediction models. In: Proceedings of the International Conference on Dependable Systems and Networks. 2006, 425−434

[112]

Gainaru A, Cappello F, Snir M, Kramer W. Fault prediction under the microscope: a closer look into HPC systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012, 1−11

[113]

Gainaru A, Cappello F, Kramer W. Taming of the shrew: modeling the normal and faulty behaviour of large-scale HPC systems. In: Proceedings of the 26th International Parallel and Distributed Processing Symposium. 2012, 1168−1179

[114]

Pelaez A, Quiroz A, Browne J C, Chuah E, Parashar M. Online failure prediction for HPC resources using decentralized clustering. In: Proceedings of the 21st International Conference on High Performance Computing. 2014, 1−9

[115]

Gunawi H S, Suminto R O, Sears R, Golliher C, Sundararaman S, , . Fail-slow at scale: evidence of hardware performance faults in large production systems. In: Proceedings of the 16th USENIX Conference on File and Storage Technologies. 2018, 1−14

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap
PDF (8331KB)

Supplementary files

FCS-22096-OF-JJ_suppl_1

2290

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/