GCSS: a global collaborative scheduling strategy for wide-area high-performance computing
Yao SONG, Limin XIAO, Liang WANG, Guangjun QIN, Bing WEI, Baicheng YAN, Chenhao ZHANG
GCSS: a global collaborative scheduling strategy for wide-area high-performance computing
Wide-area high-performance computing is widely used for large-scale parallel computing applications owing to its high computing and storage resources. However, the geographical distribution of computing and storage resources makes efficient task distribution and data placement more challenging. To achieve a higher system performance, this study proposes a two-level global collaborative scheduling strategy for wide-area high-performance computing environments. The collaborative scheduling strategy integrates lightweight solution selection, redundant data placement and task stealing mechanisms, optimizing task distribution and data placement to achieve efficient computing in wide-area environments. The experimental results indicate that compared with the state-of-the-art collaborative scheduling algorithm HPS+, the proposed scheduling strategy reduces the makespan by 23.24%, improves computing and storage resource utilization by 8.28% and 21.73% respectively, and achieves similar global data migration costs.
high-performance computing / scheduling strategy / task scheduling / data placement
[1] |
Towns J , Cockerill T , Dahan M , Foster I , Gaither K , Grimshaw A , Hazlewood V , Lathrop S , Lifka D , Peterson G D , Roskies R , Scott J R , Wilkins-Diehr N . XSEDE: accelerating scientific discovery. Computing in Science & Engineering, 2014, 16( 5): 62– 74
|
[2] |
Xie X, Xiao N, Xu Z, Zha L, Li W, Yu H. CNGrid software 2: service oriented approach to grid computing. In: Proceedings of the UK e-Science All Hands Meeting. 2005, 701– 708
|
[3] |
Skamarock W C, Klemp J B, Dudhia J, Gill D O, Powers J G. A description of the Advanced Research WRF version 2. NCAR/TN-468+STR. Boulder: National Center for Atmospheric Research, 2005
|
[4] |
Kosar T , Balman M . A new paradigm: data-aware scheduling in grid computing. Future Generation Computer Systems, 2009, 25( 4): 406– 413
|
[5] |
Chowdhury M , Zaharia M , Ma J , Jordan M I , Stoica I . Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Computer Communication Review, 2011, 41( 4): 98– 109
|
[6] |
Wang K , Qiao K , Sadooghi I , Zhou X , Li T , Lang M , Raicu I . Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales. Concurrency and Computation: Practice and Experience, 2016, 28( 1): 70– 94
|
[7] |
Kang S , Veeravalli B , Aung K M M . Dynamic scheduling strategy with efficient node availability prediction for handling divisible loads in multi-cloud systems. Journal of Parallel and Distributed Computing, 2018, 113
|
[8] |
Wei W, Li B, Liang B, Li J. Multi-resource fair sharing for datacenter jobs with placement constraints. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 1003−1014
|
[9] |
Buddhika T , Stern R , Lindburg K , Ericson K , Pallickara S . Online scheduling and interference alleviation for low-latency, high-throughput processing of data streams. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 12): 3553– 3569
|
[10] |
Kremer-Herman N, Tovar B, Thain D. A lightweight model for right-sizing master-worker applications. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 504– 516
|
[11] |
Gaussier E , Lelong J , Reis V , Trystram D . Online tuning of EASY-backfilling using queue reordering policies. IEEE Transactions on Parallel and Distributed Systems, 2018, 29( 10): 2304– 2316
|
[12] |
Carastan-Santos D, De Camargo R Y. Obtaining dynamic scheduling policies with simulation and machine learning. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 32
|
[13] |
Chen C Y . Task scheduling for maximizing performance and reliability considering fault recovery in heterogeneous distributed systems. IEEE Transactions on Parallel and Distributed Systems, 2016, 27( 2): 521– 532
|
[14] |
Xu H , Lau W C . Optimization for speculative execution in big data processing clusters. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 2): 530– 545
|
[15] |
He S , Wang Y , Sun X . Boosting parallel file system performance via heterogeneity-aware selective data layout. IEEE Transactions on Parallel and Distributed Systems, 2016, 27( 9): 2492– 2505
|
[16] |
Cameron D G, Carvajal-Schiaffino R, Millar A P, Nicholson C, Stockinger K, Zini F. Evaluating scheduling and replica optimisation strategies in OptorSim. In: Proceedings of the 1st Latin American Web Congress. 2003, 52– 59
|
[17] |
Bryk P , Malawski M , Juve G , Deelman E . Storage-aware algorithms for scheduling of workflow ensembles in clouds. Journal of Grid Computing, 2016, 14( 2): 359– 378
|
[18] |
Mon E E, Thein M M, Aung M T. Clustering based on task dependency for data-intensive workflow scheduling optimization. In: Proceedings of the 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers. 2016, 20–25
|
[19] |
Szabo C , Sheng Q Z , Kroeger T , Zhang Y , Yu J . Science in the cloud: allocation and execution of data-intensive scientific workflows. Journal of Grid Computing, 2014, 12( 2): 245– 264
|
[20] |
Zhao L , Yang Y , Munir A , Liu A X , Qu W . Optimizing geo-distributed data analytics with coordinated task scheduling and routing. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 2): 279– 293
|
[21] |
Wang S , Chen W , Zhou X , Zhang L , Wang Y . Dependency-aware network adaptive scheduling of data-intensive parallel jobs. IEEE Transactions on Parallel and Distributed Systems, 2019, 30( 3): 515– 529
|
[22] |
Wei X , Li L , Li X , Wang X , Gao S , Li H . Pec: proactive elastic collaborative resource scheduling in data stream processing. IEEE Transactions on Parallel and Distributed Systems, 2019, 30( 7): 1628– 1642
|
[23] |
Li C , Bai J , Tang J . Joint optimization of data placement and scheduling for improving user experience in edge computing. Journal of Parallel and Distributed Computing, 2019, 125
|
[24] |
Liu F, Keahey K, Riteau P, Weissman J. Dynamically negotiating capacity between on-demand and batch clusters. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 493– 503
|
[25] |
Frey J, Tannenbaum T, Livny M, Foster I, Tuecke S. Condor-G: a computation management agent for multi-institutional grids. In: Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing. 2001
|
[26] |
Wang S, Zhang X, Yang K, Wang L, Wang W. Distributed edge caching scheme considering the tradeoff between the diversity and redundancy of cached content. In: Proceedings of 2015 IEEE/CIC International Conference on Communications in China. 2015, 1– 5
|
[27] |
Yuan D , Yang Y , Liu X , Chen J . A data placement strategy in scientific cloud workflows. Future Generation Computer Systems, 2010, 26( 8): 1200– 1214
|
[28] |
Edinger J, Schäfer D, Krupitzer C, Raychoudhury V, Becker C. Fault-avoidance strategies for context-aware schedulers in pervasive computing systems. In: Proceedings of 2017 IEEE International Conference on Pervasive Computing and Communications. 2017, 79– 88
|
[29] |
Schafer D, Edinger J, Paluska J M, Vansyckel S, Becker C. Tasklets: “better than best-effort” computing. In: Proceedings of the 25th International Conference on Computer Communication and Networks. 2016, 1–11
|
[30] |
Breitbach M, Schäfer D, Edinger J, Becker C. Context-aware data and task placement in edge computing environments. In: Proceedings of 2019 IEEE International Conference on Pervasive Computing and Communications. 2019, 1– 10
|
[31] |
Wang T , Zhou J , Zhang G , Wei T , Hu S . Customer perceived value- and risk-aware multiserver configuration for profit maximization. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 5): 1074– 1088
|
[32] |
Xu Z, Stewart C, Deng N, Wang X. Blending on-demand and spot instances to lower costs for in-memory storage. In: Proceedings of the 35th Annual IEEE International Conference on Computer Communications. 2016, 1– 9
|
[33] |
Zheng N, Chen Q, Yang Y, Li J, Zheng W, Guo M. POSTER: precise capacity planning for database public clouds. In: Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques. 2019, 457–458
|
[34] |
Bharadwaj V , Ghose D , Robertazzi T G . Divisible load theory: a new paradigm for load scheduling in distributed systems. Cluster Computing, 2003, 6( 1): 7– 17
|
[35] |
McKenna R, Herbein S, Moody A, Gamblin T, Taufer M. Machine learning predictions of runtime and IO traffic on high-end clusters. In: Proceedings of 2016 IEEE International Conference on Cluster Computing. 2016, 255– 258
|
[36] |
Casanova H, Legrand A, Quinson M. SimGrid: a generic framework for large-scale distributed experiments. In: Proceedings of the 10th International Conference on Computer Modeling and Simulation. 2008, 126– 131
|
/
〈 | 〉 |