Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions
Qi ZHU, Bo WU, Xipeng SHEN, Kai SHEN, Li SHEN, Zhiying WANG
Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions
Recent years have witnessed a processor development trend that integrates central processing unit (CPU) and graphic processing unit (GPU) into a single chip. The integration helps to save some host-device data copying that a discrete GPU usually requires, but also introduces deep resource sharing and possible interference between CPU and GPU. This work investigates the performance implications of independently co-running CPU and GPU programs on these platforms. First, we perform a comprehensive measurement that covers a wide variety of factors, including processor architectures, operating systems, benchmarks, timing mechanisms, inputs, and power management schemes. These measurements reveal a number of surprising observations.We analyze these observations and produce a list of novel insights, including the important roles of operating system (OS) context switching and power management in determining the program performance, and the subtle effect of CPU-GPU data copying. Finally, we confirm those insights through case studies, and point out some promising directions to mitigate anomalous performance degradation on integrated heterogeneous processors.
performance analysis / GPGPU / co-run degradation / fused processor / program transformation
[1] |
Markatos E P, LeBlanc T J. Using processor affinity in loop scheduling on shared-memory multiprocessors. IEEE Transactions on Parallel Distributed Systems, 1994, 5(4): 379–400
CrossRef
Google scholar
|
[2] |
Squillante M S, Lazowska E D. Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Transactions on Parallel and Distributed Systems, 1993, 4(2): 131–143
CrossRef
Google scholar
|
[3] |
Gelado I, Stone J E, Cabezas J, Patel S, Navarro N, Hwu W M W. An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems. 2010, 347–358
CrossRef
Google scholar
|
[4] |
Jiang Y, Shen X P, Chen J, Tripathi R. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. 2008, 220–229
CrossRef
Google scholar
|
[5] |
Tian K, Jiang Y L, Shen X P. A study on optimally co-scheduling jobs of different lengths on chip multiprocessors. In: Proceedings of the 6th ACM Computing Frontiers. 2009, 41–50
CrossRef
Google scholar
|
[6] |
Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques. 2007, 25–38
CrossRef
Google scholar
|
[7] |
El-Moursy A, Garg R, Albonesi D H, Dwarkadas S. Compatible phase co-scheduling on a CMP of multi-threaded processors. In: Proceedings of the 20th International Parallel and Distributed Processing Symposium. 2006
CrossRef
Google scholar
|
[8] |
Menychtas K, Shen K, Scott M L. Disengaged scheduling for fair, protected access to computational accelerators. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. 2014, 301–316
CrossRef
Google scholar
|
[9] |
Kato S, Lakshmanan K, Rajkumar R, Ishikawa Y. TimeGraph: GPU scheduling for real-time multi-tasking environments. In: Proceedings of the 2011 USENIX Annual Technical Conference. 2011, 17–30
|
[10] |
Mekkat V, Holey A, Yew P C, Zhai A. Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 2013, 225–234
|
[11] |
Tuck N, Tullsen D M. Initial observations of the simultaneous multithreading Pentium 4 processor. In: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. 2003, 26–35
CrossRef
Google scholar
|
[12] |
Fousek J, Filipoviˇc J, Madzin M. Automatic fusions of CUDA-GPU kernels for parallel map. ACM SIGARCH Computer Architecture News, 2011, 39(4): 98–99
CrossRef
Google scholar
|
[13] |
Wang G B, Lin Y S, Yi W. Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: Proceedings of the 2010 IEEE/ACM International Conference on Green Computing and Communications & International Conference on Cyber, Physical and Social Computing (CPSCom). 2010, 344–350
CrossRef
Google scholar
|
[14] |
Wu H C, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S. Optimizing data warehousing applications for GPUs using kernel fusion /fission. In: Proceedings of Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). 2012, 2433–2442
|
[15] |
Aila T, Laine S. Understanding the efficiency of ray traversal on GPUs. In: Proceedings of the Conference on High Performance Graphics. 2009, 145–149
CrossRef
Google scholar
|
[16] |
Chen L, Villa O, Krishnamoorthy S, Gao G R. Dynamic load balancing on single- and multi-GPU systems. In: Proceedings of 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS). 2010, 1–12
CrossRef
Google scholar
|
[17] |
Gupta K, Stuart J A, Owens J D. A study of persistent threads style GPU programming for GPGPU workloads. In: Proceedings of Innovative Parallel Computing (InPar), 2012. 2012, 1–14
CrossRef
Google scholar
|
[18] |
Xiao S C, Feng W C. Inter-block GPU communication via fast barrier synchronization. In: Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Proceedings. 2010, 1–12
|
[19] |
Li C P, Ding C, Shen K. Quantifying the cost of context switch. In: Proceedings of the 2007 Workshop on Experimental Computer Science. 2007, 2
CrossRef
Google scholar
|
[20] |
Muralidhara S P, Subramanian L, Mutlu O, Kandemir M, Moscibroda T. Reducing memory interference in multicore systems via applicationaware memory channel partitioning. In: Proceedings of the 44th Annual IEEE/ACMInternational Symposium on Microarchitecture. 2011, 374–385
|
[21] |
Liu L, Li Y, Cui Z H, Bao Y G, Chen M Y, Wu C Y. Going vertical in memory management: handling multiplicity by multi-policy. In: Proceedings of the 2014 ACM/IEEE 41st International Symposium on Computer Architecture. 2014, 169–180
CrossRef
Google scholar
|
[22] |
Lin J, Lu Q D, Ding X N, Zhang Z, Zhang X D, Sadayappan P. Gaining insights into multicore cache partitioning: bridging the gap between simulation and real systems. In: Proceedings of the 14th IEEE International Symposium on High Performance Computer Architecture. 2008, 367–378
|
[23] |
Liu L, Cui Z H, Xing M J, Bao Y G, Chen M Y, Wu C Y. A software memory partition approach for eliminating bank-level interference in multicore systems. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. 2012, 367–376
CrossRef
Google scholar
|
[24] |
Chang J C, Sohi G S. Cooperative cache partitioning for chip multiprocessors. In: Proceedings of the 25th Annual International Conference on Supercomputing. 2007, 242–252
CrossRef
Google scholar
|
[25] |
Rafique N, Lim W T, Thottethodi M. Architectural support for operating system-driven CMP cache management. In: Proceedings of the 15th International Conference on Parallel Architecture and Compilation Techniques. 2006, 2–12
CrossRef
Google scholar
|
[26] |
Suh G E, Devadas S, Rudolph L. A new memory monitoring scheme for memory-aware scheduling and partitioning. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture. 2002, 117–128
CrossRef
Google scholar
|
[27] |
Qureshi M K, Patt Y N. Utility-based cache partitioning: a lowoverhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of the 39th International Symposium on Microarchitecture. 2006, 423–432
CrossRef
Google scholar
|
[28] |
Zhang E Z, Jiang Y L, Shen X P. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2010, 203–212
CrossRef
Google scholar
|
[29] |
Kim Y, Papamichael M, Mutlu O, Harchol-Balter M. Thread cluster memory scheduling: exploiting differences in memory access behavior. In: Proceedings of the 43rd Annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO). 2010, 65–76
CrossRef
Google scholar
|
[30] |
Cook H, Moreto M, Bird S, Dao K, Patterson D A, Asanovic K. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. ACM SIGARCH Computer Architecture News, 2013, 41(3): 308–319
CrossRef
Google scholar
|
[31] |
Xie Y J, Loh G H. Scalable shared-cache management by containing thrashing workloads. In: Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers. 2010, 262–276
CrossRef
Google scholar
|
[32] |
Mars J, Tang L J. Whare-Map: Heterogeneity in “homogeneous” warehouse-scale computers. In: Proceedings of the 40th International Symposium on Computer Architecture (ISCA). 2013, 1–12
CrossRef
Google scholar
|
[33] |
Zahedi S M, Lee B C. REF: resource elasticity fairness with sharing incentives for multiprocessors. ACM SIGPLAN Notices, 2014, 49(4): 145–160
CrossRef
Google scholar
|
[34] |
Ausavarungnirun R, Chang K K W, Subramanian L, Loh G H, Mutlu O. Staged memory scheduling: achieving high performance and scalability in heterogeneous systems. ACM SIGARCH Computer Architecture News, 2012, 40(3): 416–427
CrossRef
Google scholar
|
[35] |
Zhu Q, Wu B, Shen X P, Shen L, Wang Z Y. Understanding co-run degradations on integrated heterogeneous processors. In: Proceedings of International Workshop on Languages and Compilers for Parallel Computing (LCPC). 2014, 82–97
|
/
〈 | 〉 |