Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directions

Zhengxiong HOU, Hong SHEN, Xingshe ZHOU, Jianhua GU, Yunlan WANG, Tianhai ZHAO

PDF(2874 KB)
PDF(2874 KB)
Front. Comput. Sci. ›› 2022, Vol. 16 ›› Issue (5) : 165107. DOI: 10.1007/s11704-022-0625-8
Architecture
REVIEW ARTICLE

Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directions

Author information +
History +

Abstract

Nowadays, high-performance computing (HPC) clusters are increasingly popular. Large volumes of job logs recording many years of operation traces have been accumulated. In the same time, the HPC cloud makes it possible to access HPC services remotely. For executing applications, both HPC end-users and cloud users need to request specific resources for different workloads by themselves. As users are usually not familiar with the hardware details and software layers, as well as the performance behavior of the underlying HPC systems. It is hard for them to select optimal resource configurations in terms of performance, cost, and energy efficiency. Hence, how to provide on-demand services with intelligent resource allocation is a critical issue in the HPC community. Prediction of job characteristics plays a key role for intelligent resource allocation. This paper presents a survey of the existing work and future directions for prediction of job characteristics for intelligent resource allocation in HPC systems. We first review the existing techniques in obtaining performance and energy consumption data of jobs. Then we survey the techniques for single-objective oriented predictions on runtime, queue time, power and energy consumption, cost and optimal resource configuration for input jobs, as well as multi-objective oriented predictions. We conclude after discussing future trends, research challenges and possible solutions towards intelligent resource allocation in HPC systems.

Graphical abstract

Keywords

high-performance computing / performance prediction / job characteristics / intelligent resource allocation / cloud computing / machine learning

Cite this article

Download citation ▾
Zhengxiong HOU, Hong SHEN, Xingshe ZHOU, Jianhua GU, Yunlan WANG, Tianhai ZHAO. Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directions. Front. Comput. Sci., 2022, 16(5): 165107 https://doi.org/10.1007/s11704-022-0625-8

Zhengxiong Hou received the PhD degree in computer science and technology from Northwestern Polytechnical University, China. He is an associate professor at the Center for High-Performance Computing, School of Computer Science, Northwestern Polytechnical University, China. His research interests include intelligent resource management and job scheduling, performance optimization in HPC clusters and clouds

Hong Shen received the BEng degree from the Beijing University of Science and Technology, the MEng degree from the University of Science and Technology of China, the PhLic and PhD degrees from Abo Akademi University, Finland, all in computer science. He is currently a specially-appointed professor at Sun Yat-Sen University, China. He was a professor (Chair) of computer science in the University of Adelaide, Australia. With main research interests in parallel and distributed computing, algorithms, and high performance networks, he has published more than 300 papers including more than 100 papers in international journals such as a variety of IEEE and ACM transactions

Xingshe Zhou received the BS and MS degrees in computer science from Northwestern Polytechnical University, China. He is a professor with the School of Computer Science, Northwestern Polytechnical University, China. He was the dean and director of the Center for High-Performance Computing of this university. His research interests include embedded computing and distributed computing. He has published more than 100 papers in international journals and conferences

Jianhua Gu received the PhD degree in computer science and engineering from Northwestern Polytechnical University, China. He is a professor at the Center for High-Performance Computing, School of Computer Science, Northwestern Polytechnical University, China. His research interests include operating system and cloud computing

Yunlan Wang received the PhD degree in computer science from Xi’an Jiaotong University, China. She is an associate professor at the Center for High-Performance Computing, School of Computer Science, Northwestern Polytechnical University, China. Her research interests include high-performance computing and data mining

Tianhai Zhao received the PhD degree in computer science from Xi’an Jiaotong University, China. He is a lecturer at the Center for High-Performance Computing, School of Computer Science, Northwestern Polytechnical University, China. His research interests include parallel computing and cloud computing

References

[1]
Feitelson D G , Tsafrir D , Krakov D . Experience with using the parallel workloads archive. Journal of Parallel and Distributed Computing, 2014, 74( 10): 2967– 2982
[2]
Wallace S Yang X Vishwanath V Allcock W E Coghlan S Papka M E Lan Z. A data driven scheduling approach for power management on HPC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 56
[3]
Tsujita Y Uno A Sekizaw R Yamamoto K Sueyasu F. Job classification through long-term log analysis towards power-aware HPC system operation. In: Proceedings of the 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). 2021, 26– 34
[4]
Fan Y Rich P Allcock W E Papka M E Lan Z. Trade-off between prediction accuracy and underestimation rate in job runtime estimates. In: Proceedings of the 2017 IEEE International Conference on Cluster Computing (CLUSTER). 2017, 530– 540
[5]
Netto M A S , Calheiros R N , Rodrigues E R , Cunha R L F , Buyya R . HPC cloud for scientific and business applications: taxonomy, vision, and research challenges. ACM Computing Surveys, 2019, 51( 1): 8
[6]
Mariani G , Anghel A , Jongerius R , Dittmann G . Predicting cloud performance for HPC applications before deployment. Future Generation Computer Systems, 2018, 87: 618– 628
[7]
Orgerie A C , De Assuncao M D , Lefevre L . A survey on techniques for improving the energy efficiency of large-scale distributed systems. ACM Computing Surveys, 2014, 46( 4): 47
[8]
Kelechi A H , Alsharif M H , Bameyi O J , Ezra P J , Joseph I K , Atayero A A , Geem Z W , Hong J . Artificial intelligence: an energy efficiency tool for enhanced high performance computing. Symmetry, 2020, 12( 6): 1029
[9]
Wang E D. High Productivity Computing System: Design and Applications. China Science Publishing & Media Ltd, 2014
[10]
Prabhakaran S. Dynamic resource management and job scheduling for high performance computing. Technische Universität Darmstadt, Dissertation, 2016
[11]
Ge R Cameron K W. Power-aware speedup. In: Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium. 2007, 1– 10
[12]
Cunha R L F , Rodrigues E R , Tizzei L P , Netto M A S . Job placement advisor based on turnaround predictions for HPC hybrid clouds. Future Generation Computer Systems, 2017, 67: 35– 46
[13]
Leite A F , Boukerche A , De Melo A C M A , Eisenbeis C , Tadonki C , Ralha C G . Power-aware server consolidation for federated clouds. Concurrency and Computation: Practice and Experience, 2016, 28( 12): 3427– 3444
[14]
Yu L , Zhou Z , Fan Y , Papka M E , Lan Z . System-wide trade-off modeling of performance, power, and resilience on petascale systems. The Journal of Supercomputing, 2018, 74( 7): 3168– 3192
[15]
Blagodurov S Fedorova A Vinnik E Dwyer T Hermenier F. Multi-objective job placement in clusters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 66
[16]
Toosi A N , Calheiros R N , Buyya R . Interconnected cloud computing environments: challenges, taxonomy, and survey. ACM Computing Surveys, 2014, 47( 1): 7
[17]
Hou Z , Wang Y , Sui Y , Gu J , Zhao T , Zhou X . Managing high-performance computing applications as an on-demand service on federated clouds. Computers & Electrical Engineering, 2018, 67: 579– 595
[18]
Hussain H , Malik S U R , Hameed A , Khan S U , Bickler G , Min-Allah N , Qureshi M B , Zhang L , Wang Y , Ghani N , Kolodziej J , Zomaya A Y , Xu C Z , Balaji P , Vishnu A , Pinel F , Pecero J E , Kliazovich D , Bouvry P , Li H , Wang L , Chen D , Rayes A . A survey on resource allocation in high performance distributed computing systems. Parallel Computing, 2013, 39( 11): 709– 736
[19]
Massie M L , Chun B N , Culler D E . The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 2004, 30( 7): 817– 840
[20]
Allcock W Rich P Fan Y Lan Z. Experience and practice of batch scheduling on leadership supercomputers at Argonne. In: Proceedings of 21st Job Scheduling Strategies for Parallel Processing. 2017, 1− 24
[21]
Yoon J , Hong T , Park C , Noh S Y , Yu H . Log analysis-based resource and execution time improvement in HPC: a case study. Applied Sciences, 2020, 10( 7): 2634
[22]
Islam S , Keung J , Lee K , Liu A . Empirical prediction models for adaptive resource provisioning in the cloud. Future Generation Computer Systems, 2012, 28( 1): 155– 162
[23]
Cortez E Bonde A Muzio A Russinovich M Fontoura M Bianchini R. Resource central: understanding and predicting workloads for improved resource management in large cloud platforms. In: Proceedings of the 26th Symposium on Operating Systems Principles. 2017, 153− 167
[24]
Marowka A. On performance analysis of a multithreaded application parallelized by different programming models using Intel VTune. In: Proceedings of the 11th International Conference on Parallel Computing Technologies. 2011, 317− 331
[25]
Terpstra D Jagode H You H Dongarra J. Collecting performance data with PAPI-C. In: Proceedings of the 3rd International Workshop on Parallel Tools for High Performance Computing. 2009, 157− 173
[26]
Dimakopoulou M Eranian S Koziris N Bambos N. Reliable and efficient performance monitoring in Linux. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 396− 408
[27]
Weaver V M. Self-monitoring Overhead of the Linux perf_event performance counter interface. In: Proceedings of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software. 2015, 102− 111
[28]
Treibig J Hager G Wellein G. LIKWID: a lightweight performance-oriented tool suite for x86 multicore environments. In: Proceedings of the 39th International Conference on Parallel Processing Workshops. 2010, 207− 216
[29]
Pospiech C. Hardware performance monitor (HPM) toolkit users guide. Advanced Computing Technology Center, IBM Research. See researcher.watson.ibm.com/researcher/files/us-hfwen/HPM_ug.pdf website, 2008
[30]
Georgiou Y Glesser D Rzadca K Trystram D. A scheduler-level incentive mechanism for energy efficiency in HPC. In: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 2015, 617− 626
[31]
Raghu H V Saurav S K Bapu B S. PAAS: power aware algorithm for scheduling in high performance computing. In: Proceedings of the 6th IEEE/ACM International Conference on Utility and Cloud Computing. 2013, 327− 332
[32]
Wallace S Vishwanath V Coghlan S Tramm J Lan Z Papka M E. Application power profiling on IBM Blue Gene/Q. In: Proceedings of the 2013 IEEE International Conference on Cluster Computing (CLUSTER). 2013, 1− 8
[33]
Browne S , Dongarra J , Garner N , Ho G , Mucci P . A portable programming interface for performance evaluation on modern processors. The International Journal of High Performance Computing Applications, 2000, 14( 3): 189– 204
[34]
Rashti M Sabin G Vansickle D Norris B. WattProf: a flexible platform for fine-grained HPC power profiling. In: Proceedings of the 2015 IEEE International Conference on Cluster Computing. 2015, 698− 705
[35]
Laros J H DeBonis D Grant R E Kelly S M Levenhagen M Olivier S Pedretti K. High performance computing-power application programming interface specification, version 1.2. See cfwebprod.sandia.gov/cfdocs/CompResearch/docs/PowerAPI_SAND_V1.1a(3).pdf website, 2016
[36]
Kavanagh R , Djemame K . Rapid and accurate energy models through calibration with IPMI and RAPL. Concurrency and Computation: Practice and Experience, 2019, 31( 13): e5124
[37]
Weaver V M Johnson M Kasichayanula K Ralph J Luszczek P Terpstra D Moore S. Measuring energy and power with PAPI. In: Proceedings of the 41st International Conference on Parallel Processing Workshops. 2012, 262− 268
[38]
Rotem E , Naveh A , Ananthakrishnan A , Weissmann E , Rajwan D . Power-management architecture of the Intel microarchitecture code-named Sandy Bridge. IEEE Micro, 2012, 32( 2): 20– 27
[39]
Leng J Hetherington T ElTantawy A Gilani S Kim N S Aamodt T M Reddi V J. GPUwattch: enabling energy optimizations in GPGPUs. In: Proceedings of the 40th Annual International Symposium on Computer Architecture. 2013, 487− 498
[40]
Saillant T Weill J C Mougeot M. Predicting job power consumption based on RJMS submission data in HPC systems. In: Proceedings of the 35th International Conference on High Performance Computing. 2020, 63− 82
[41]
Jin C , De Supinski B R , Abramson D , Poxon H , DeRose L , Dinh M N , Endrei M , Jessup E R . A survey on software methods to improve the energy efficiency of parallel computing. The International Journal of High Performance Computing Applications, 2017, 31( 6): 517– 549
[42]
Georgiou Y Cadeau T Glesser D Auble D Jette M Hautreux M. Energy accounting and control with SLURM resource and job management system. In: Proceedings of the 15th International Conference on Distributed Computing and Networking. 2014, 96− 118
[43]
Martin S J Rush D Kappel M. Cray advanced platform monitoring and control. In: Proceedings of the Cray User Group Meeting, Chicago, IL. See cug.org/proceedings/cug2015_proceedings/includes/files/pap132-file2.pdf website, 2015, 26− 30
[44]
Thain D , Tannenbaum T , Livny M . Distributed computing in practice: the Condor experience. Concurrency and Computation: Practice and Experience, 2005, 17( 2-4): 323– 356
[45]
Yoo A B Jette M A Grondona M. SLURM: simple Linux utility for resource management. In: Proceedings of the 9th Workshop on Job Scheduling Strategies for Parallel Processing. 2003, 44− 60
[46]
Gibbons R. A historical application profiler for use by parallel schedulers. In: Proceedings of Workshop on Job Scheduling Strategies for Parallel Processing. 1997, 58− 77
[47]
Smith W , Foster I , Taylor V . Predicting application run times with historical information. Journal of Parallel and Distributed Computing, 2004, 64( 9): 1007– 1016
[48]
Schopf J M Berman F. Using stochastic intervals to predict application behavior on contended resources. In: Proceedings of the Fourth International Symposium on Parallel Architectures, Algorithms, and Networks. 1999, 344− 349
[49]
Mendes C L Reed D A. Integrated compilation and scalability analysis for parallel systems. In: Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques. 1998, 385− 392
[50]
Nissimov A. Locality and its usage in parallel job runtime distribution modeling using HMM. Hebrew University, Dissertation, 2006
[51]
Rabiner L R . A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 1989, 77( 2): 257– 286
[52]
Tsafrir D , Etsion Y , Feitelson D G . Backfilling using system-generated predictions rather than user runtime estimates. IEEE Transactions on Parallel and Distributed Systems, 2007, 18( 6): 789– 803
[53]
Hou Z Zhao S Yin C Wang Y Gu J Zhou X. Machine learning based performance analysis and prediction of jobs on a HPC cluster. In: Proceedings of the 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT). 2019, 247− 252
[54]
Matsunaga A Fortes J A B. On the use of machine learning to predict the time and resources consumed by applications. In: Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. 2010, 495− 504
[55]
Duan R Nadeem F Wang J Zhang Y Prodan R Fahringer T. A hybrid intelligent method for performance modeling and prediction of workflow activities in grids. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. 2009, 339− 347
[56]
Gaussier E Glesser D Reis V Trystram D. Improving backfilling by using machine learning to predict running times. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 1− 10
[57]
Li J , Zhang X , Han L , Ji Z , Dong X , Hu C . OKCM: improving parallel task scheduling in high-performance computing systems using online learning. The Journal of Supercomputing, 2021, 77( 6): 5960– 5983
[58]
McGough A S Moubayed N A Forshaw M. Using machine learning in trace-driven energy-aware simulations of high-throughput computing systems. In: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion. 2017, 55− 60
[59]
Chen X Zhang H Bai H Yang C Zhao X Li B. Runtime prediction of high-performance computing jobs based on ensemble learning. In: Proceedings of the 4th International Conference on High Performance Compilation, Computing and Communications. 2020, 56− 62
[60]
Wu G B Shen Y Zhang W S Liao S S Wang Q Q Li J. Runtime prediction of jobs for backfilling optimization. Journal of Chinese Computer Systems (in Chinese), 2019, 40(1): 6− 12
[61]
Xiao Y H Xu L F Xiong M. GA-Sim: a job running time prediction algorithm based on categorization and instance learning. Computer Engineering & Science (in Chinese), 2019, 41(6): 987− 992
[62]
Parashar M , AbdelBaky M , Rodero I , Devarakonda A . Cloud paradigms and practices for computational and data-enabled science and engineering. Computing in Science & Engineering, 2013, 15( 4): 10– 18
[63]
Li X Palit H Foo Y S Hung T. Building an HPC-as-a-service toolkit for user-interactive HPC services in the cloud. In: Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications. 2011, 369− 374
[64]
Shi J Y Taifi M Pradeep A Khreishah A Antony V. Program scalability analysis for HPC cloud: applying Amdahl’s law to NAS benchmarks. In: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. 2012, 1215− 1225
[65]
Saad A , El-Mahdy A . HPCCloud seer: a performance model based predictor for parallel applications on the cloud. IEEE Access, 2020, 8: 87978– 87993
[66]
Fan C T Chang Y S Wang W J Yuan S M. Execution time prediction using rough set theory in hybrid cloud. In: Proceedings of the 9th International Conference on Ubiquitous Intelligence and Computing and 9th International Conference on Autonomic and Trusted Computing. 2012, 729− 734
[67]
Smith W Taylor V E Foster I T. Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Proceedings of the Job Scheduling Strategies for Parallel Processing. 1999, 202− 219
[68]
Nurmi D Brevik J Wolski R. QBETS: queue bounds estimation from time series. In: Proceedings of the 13th Workshop on Job Scheduling Strategies for Parallel Processing. 2007, 76− 101
[69]
Brevik J Nurmi D Wolski R. Predicting bounds on queuing delay for batch-scheduled parallel machines. In: Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2006, 110− 118
[70]
Nurmi D Mandal A Brevik J Koelbel C Wolski R Kennedy K. Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. 2006, 29
[71]
Netto M A S , Cunha R L F , Sultanum N . Deciding when and how to move HPC jobs to the cloud. Computer, 2015, 48( 11): 86– 89
[72]
Smith W. A service for queue prediction and job statistics. In : Proceedings of the 2010 Gateway Computing Environments Workshop (GCE). 2010, 1− 8
[73]
Murali P , Vadhiyar S . Qespera: an adaptive framework for prediction of queue waiting times in supercomputer systems. Concurrency and Computation: Practice and Experience, 2016, 28( 9): 2685– 2710
[74]
Murali P , Vadhiyar S . Metascheduling of HPC jobs in day-ahead electricity markets. IEEE Transactions on Parallel and Distributed Systems, 2018, 29( 3): 614– 627
[75]
Elnozahy E N Kistler M Rajamony R. Energy-efficient server clusters. In: Proceedings of the 2nd International Workshop on Power-aware Computer Systems. 2002, 179− 197
[76]
Lawson B Smirni E. Power-aware resource allocation in high-end systems via online simulation. In: Proceedings of the 19th Annual International Conference on Supercomputing. 2005, 229− 238
[77]
Etinski M Corbalan J Labarta J Valero M. Optimizing job performance under a given power constraint in HPC centers. In: Proceedings of the International Conference on Green Computing. 2010, 257− 267
[78]
Etinski M , Corbalan J , Labarta J , Valero M . Parallel job scheduling for power constrained HPC systems. Parallel Computing, 2012, 38( 12): 615– 630
[79]
Mämmelä O , Majanen M , Basmadjian R , De Meer H , Giesler A , Homberg W . Energy-aware job scheduler for high-performance computing. Computer Science - Research and Development, 2012, 27( 4): 265– 275
[80]
Zhou Z Lan Z Tang W Desai N. Reducing energy costs for IBM Blue Gene/P via power-aware job scheduling. In: Proceedings of the 17th Workshop on Job Scheduling Strategies for Parallel Processing. 2014, 96− 115
[81]
Marathe A Bailey P E Lowenthal D K Rountree B Schulz M De Supinski B R. A run-time system for power-constrained HPC applications. In: Proceedings of the 30th International Conference on High Performance Computing. 2015, 394− 408
[82]
Dhiman G Mihic K Rosing T. A system for online power prediction in virtualized environments using gaussian mixture models. In: Proceedings of the 47th Design Automation Conference. 2010, 807− 812
[83]
Basmadjian R De Meer H. Evaluating and modeling power consumption of multi-core processors. In: Proceedings of the 3rd International Conference on Future Systems: Where Energy, Computing and Communication Meet (e-Energy). 2012, 1− 10
[84]
Basmadjian R Costa G D Chetsa G L T Lefevre L Oleksiak A Pierson J M. Energy-aware approaches for HPC systems. In: Jeannot E, Žilinskas J, eds. High-Performance Computing on Complex Environments. Hoboken: John Wiley & Sons, Inc, 2014
[85]
Subramaniam B Feng W C. Statistical power and performance modeling for optimizing the energy efficiency of scientific computing. In: Proceedings of the 2010 IEEE/ACM Int’l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing. 2010, 139− 146
[86]
John L K Eeckhout L. Performance Evaluation and Benchmarking. New York: CRC Press, 2005
[87]
Patki T Lowenthal D K Rountree B Schulz M De Supinski B R. Exploring hardware overprovisioning in power-constrained, high performance computing. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. 2013, 173− 182
[88]
Patki T Lowenthal D K Sasidharan A Maiterth M Rountree B L Schulz M De Supinski B R. Practical resource management in power-constrained, high performance computing. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. 2015, 121− 132
[89]
Sarood O Langer A Gupta A Kale L. Maximizing throughput of overprovisioned HPC data centers under a strict power budget. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2014, 807− 818
[90]
Ellsworth D A Malony A D Rountree B Schulz M. Dynamic power sharing for higher job throughput. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 80
[91]
Chiesi M , Vanzolini L , Mucci C , Scarselli E F , Guerrieri R . Power-aware job scheduling on heterogeneous multicore architectures. IEEE Transactions on Parallel and Distributed Systems, 2015, 26( 3): 868– 877
[92]
Sîrbu A Babaoglu O. Power consumption modeling and prediction in a hybrid CPU-GPU-MIC supercomputer. In: Proceedings of the 22nd European Conference on Parallel Processing. 2016, 117− 130
[93]
Ciznicki M , Kurowski K , Weglarz J . Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures. Cluster Computing, 2017, 20( 3): 2535– 2549
[94]
Dayarathna M , Wen Y , Fan R . Data center energy consumption modeling: a survey. IEEE Communications Surveys & Tutorials, 2016, 18( 1): 732– 794
[95]
Lee E K Viswanathan H Pompili D. VMAP: proactive thermal-aware virtual machine allocation in HPC cloud datacenters. In: Proceedings of the 19th International Conference on High Performance Computing. 2012, 1− 10
[96]
Aversa R Di Martino B Rak M Venticinque S Villano U. Performance prediction for HPC on clouds. In: Buyya R, Broberg J, Goscinski A, eds. Cloud Computing: Principles and Paradigms. Hoboken: John Wiley & Sons, Inc, 2011
[97]
Liu M Jin Y Zhai J Zha Y Shi Q Ma X Chen W. ACIC: automatic cloud I/O configurator for HPC applications. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2013, 1− 12
[98]
Rak M , Turtur M , Villano U . Early prediction of the cost of cloud usage for HPC applications. Scalable Computing: Practice and Experience, 2015, 16( 3): 303– 320
[99]
Geist A , Reed D A . A survey of high-performance computing scaling challenges. The International Journal of High Performance Computing Applications, 2017, 31( 1): 104– 113
[100]
Wang Z O’Boyle M F P. Mapping parallelism to multi-cores: a machine learning based approach. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2009, 75− 84
[101]
Cochran R Hankendi C Coskun A Reda S. Identifying the optimal energy-efficient operating points of parallel workloads. In: Proceedings of the 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 2011, 608− 615
[102]
Gomatheeshwari B , Selvakumar J . Appropriate allocation of workloads on performance asymmetric multicore architectures via deep learning algorithms. Microprocessors and Microsystems, 2020, 73: 102996
[103]
Bai X , Wang E , Dong X , Zhang X . A scalability prediction approach for multi-threaded applications on manycore processors. The Journal of Supercomputing, 2015, 71( 11): 4072– 4094
[104]
Ju T Wu W Chen H Zhu Z Dong X. Thread count prediction model: dynamically adjusting threads for heterogeneous many-core systems. In: Proceedings of the 21st IEEE International Conference on Parallel and Distributed Systems. 2015, 456− 464
[105]
Lawson G Sundriyal V Sosonkina M Shen Y. Modeling performance and energy for applications offloaded to Intel Xeon Phi. In: Proceedings of the 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing. 2015, 7
[106]
Ozer G Garg S Davoudi N Poerwawinata G Maiterth M Netti A Tafani D. Towards a predictive energy model for HPC runtime systems using supervised learning. In: Proceedings of the European Conference on Parallel Processing. 2019, 626− 638
[107]
Niu S , Zhai J , Ma X , Tang X , Chen W , Zheng W . Building semi-elastic virtual clusters for cost-effective HPC cloud resource provisioning. IEEE Transactions on Parallel and Distributed Systems, 2016, 27( 7): 1915– 1928
[108]
Balaprakash P Tiwari A Wild S M Carrington L Hovland P D. AutoMOMML: automatic multi-objective modeling with machine learning. In: Proceedings of the 31st International Conference on High Performance Computing. 2016, 219− 239
[109]
Curtis-Maury M , Blagojevic F , Antonopoulos C D , Nikolopoulos D S . Prediction-based power-performance adaptation of multithreaded scientific codes. IEEE Transactions on Parallel and Distributed Systems, 2008, 19( 10): 1396– 1410
[110]
De Sensi D. Predicting performance and power consumption of parallel applications. In: Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). 2016, 200− 207
[111]
Endrei M Jin C Dinh M N Abramson D Poxon H DeRose L De Supinski B R. Energy efficiency modeling of parallel applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 212− 224
[112]
Manumachu R R , Lastovetsky A . Bi-objective optimization of data-parallel applications on homogeneous multicore clusters for performance and energy. IEEE Transactions on Computers, 2018, 67( 2): 160– 177
[113]
Hao M , Zhang W , Wang Y , Lu G , Wang F , Vasilakos A V . Fine-grained powercap allocation for power-constrained systems based on multi-objective machine learning. IEEE Transactions on Parallel and Distributed Systems, 2021, 32( 7): 1789– 1801
[114]
Scogland T Azose J Rohr D Rivoire S Bates N Hackenberg D. Node Variability in Large-Scale Power Measurements: perspectives from the Green500, Top500 and EEHPCWG. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 1− 11
[115]
Foster I Zhao Y Raicu I Lu S. Cloud computing and grid computing 360-degree compared. In: Proceedings of the 2008 Grid Computing Environments Workshop. 2008, 1− 10
[116]
Seneviratne S Witharana S. A survey on methodologies for runtime prediction on grid environments. In: Proceedings of the 7th International Conference on Information and Automation for Sustainability. 2014, 1− 6
[117]
Yang Q , Liu Y , Chen T , Tong Y . Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology, 2019, 10( 2): 12
[118]
Ben-Nun T , Hoefler T . Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Computing Surveys, 2020, 52( 4): 65
[119]
Li C , Sun H , Tang H , Luo Y . Adaptive resource allocation based on the billing granularity in edge-cloud architecture. Computer Communications, 2019, 145: 29– 42
[120]
Orhean A I , Pop F , Raicu I . New scheduling approach using reinforcement learning for heterogeneous distributed systems. Journal of Parallel and Distributed Computing, 2018, 117: 292– 302
[121]
Chen C L P , Liu Z . Broad learning system: an effective and efficient incremental learning system without the need for deep architecture. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29( 1): 10– 24
[122]
Naghshnejad M , Singhal M . A hybrid scheduling platform: a runtime prediction reliability aware scheduling platform to improve HPC scheduling performance. The Journal of Supercomputing, 2020, 76( 1): 122– 149
[123]
Ye D , Chen D Z , Zhang G . Online scheduling of moldable parallel tasks. Journal of Scheduling, 2018, 21( 6): 647– 654
[124]
Dongarra J J Simon H D. High performance computing in the US in 1995 - An analysis on the basis of the TOP500 list. Supercomputer, 1997, 13(1): 19− 28
[125]
Feng W C , Cameron K W . The Green500 list: encouraging sustainable supercomputing. Computer, 2007, 40( 12): 50– 55
[126]
Wienke S Iliev H Mey D A Muller M S. Modeling the productivity of HPC systems on a computing center scale. In: Proceedings of the 30th International Conference on High Performance Computing. 2015, 358− 375
[127]
Dongarra J , Graybill R , Harrod W , Lucas R , Lusk E , Luszczek P , Mcmahon J , Snavely A , Vetter J , Yelick K , Alam S , Campbell R , Carrington L , Chen T Y , Khalili O , Meredith J , Tikir M . DARPA’s HPCS program: history, models, tools, languages. Advances in Computers, 2008, 72: 1– 100

Acknowledgements

We would like to thank the anonymous reviewers for their valuable comments and suggestions. We also thank Dr./Prof. Feng Zhang from Renmin University of China and Dr./Prof. Jidong Zhai from Tsinghua University, China for their helpful suggestions and discussions. This work was partly supported by the National Key R&D Program of China (2018YFB0204100), the Science & Technology Innovation Project of Shaanxi Province (2019ZDLGY17-02), and the Fundamental Research Funds for the Central Universities.

RIGHTS & PERMISSIONS

2022 Higher Education Press
AI Summary AI Mindmap
PDF(2874 KB)

Accesses

Citations

Detail

Sections
Recommended

/