Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions

Qi ZHU; Bo WU; Xipeng SHEN; Kai SHEN; Li SHEN; Zhiying WANG

doi:10.1007/s11704-016-5468-8

Front. Comput. Sci. ›› 2017, Vol. 11 ›› Issue (1) : 130 -146. DOI: 10.1007/s11704-016-5468-8

RESEARCH ARTICLE

Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions

Author information +

History +

PDF (1100KB)

Abstract

Recent years have witnessed a processor development trend that integrates central processing unit (CPU) and graphic processing unit (GPU) into a single chip. The integration helps to save some host-device data copying that a discrete GPU usually requires, but also introduces deep resource sharing and possible interference between CPU and GPU. This work investigates the performance implications of independently co-running CPU and GPU programs on these platforms. First, we perform a comprehensive measurement that covers a wide variety of factors, including processor architectures, operating systems, benchmarks, timing mechanisms, inputs, and power management schemes. These measurements reveal a number of surprising observations.We analyze these observations and produce a list of novel insights, including the important roles of operating system (OS) context switching and power management in determining the program performance, and the subtle effect of CPU-GPU data copying. Finally, we confirm those insights through case studies, and point out some promising directions to mitigate anomalous performance degradation on integrated heterogeneous processors.

Keywords

performance analysis / GPGPU / co-run degradation / fused processor / program transformation

Cite this article

Download citation ▾

Qi ZHU, Bo WU, Xipeng SHEN, Kai SHEN, Li SHEN, Zhiying WANG. Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions. Front. Comput. Sci., 2017, 11(1): 130-146 DOI:10.1007/s11704-016-5468-8

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Markatos E P, LeBlanc T J. Using processor affinity in loop scheduling on shared-memory multiprocessors. IEEE Transactions on Parallel Distributed Systems, 1994, 5(4): 379–400

[2]	Squillante M S, Lazowska E D. Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Transactions on Parallel and Distributed Systems, 1993, 4(2): 131–143

[3]	Gelado I, Stone J E, Cabezas J, Patel S, Navarro N, Hwu W M W. An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems. 2010, 347–358

[4]	Jiang Y, Shen X P, Chen J, Tripathi R. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. 2008, 220–229

[5]	Tian K, Jiang Y L, Shen X P. A study on optimally co-scheduling jobs of different lengths on chip multiprocessors. In: Proceedings of the 6th ACM Computing Frontiers. 2009, 41–50

[6]	Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques. 2007, 25–38

[7]	El-Moursy A, Garg R, Albonesi D H, Dwarkadas S. Compatible phase co-scheduling on a CMP of multi-threaded processors. In: Proceedings of the 20th International Parallel and Distributed Processing Symposium. 2006

[8]	Menychtas K, Shen K, Scott M L. Disengaged scheduling for fair, protected access to computational accelerators. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. 2014, 301–316

[9]	Kato S, Lakshmanan K, Rajkumar R, Ishikawa Y. TimeGraph: GPU scheduling for real-time multi-tasking environments. In: Proceedings of the 2011 USENIX Annual Technical Conference. 2011, 17–30

[10]	Mekkat V, Holey A, Yew P C, Zhai A. Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 2013, 225–234

[11]	Tuck N, Tullsen D M. Initial observations of the simultaneous multithreading Pentium 4 processor. In: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. 2003, 26–35

[12]	Fousek J, Filipoviˇc J, Madzin M. Automatic fusions of CUDA-GPU kernels for parallel map. ACM SIGARCH Computer Architecture News, 2011, 39(4): 98–99

[13]

Wang G B, Lin Y S, Yi W. Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: Proceedings of the 2010 IEEE/ACM International Conference on Green Computing and Communications & International Conference on Cyber, Physical and Social Computing (CPSCom). 2010, 344–350

[14]	Wu H C, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S. Optimizing data warehousing applications for GPUs using kernel fusion /fission. In: Proceedings of Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). 2012, 2433–2442

[15]	Aila T, Laine S. Understanding the efficiency of ray traversal on GPUs. In: Proceedings of the Conference on High Performance Graphics. 2009, 145–149

[16]	Chen L, Villa O, Krishnamoorthy S, Gao G R. Dynamic load balancing on single- and multi-GPU systems. In: Proceedings of 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS). 2010, 1–12

[17]	Gupta K, Stuart J A, Owens J D. A study of persistent threads style GPU programming for GPGPU workloads. In: Proceedings of Innovative Parallel Computing (InPar), 2012. 2012, 1–14

[18]	Xiao S C, Feng W C. Inter-block GPU communication via fast barrier synchronization. In: Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Proceedings. 2010, 1–12

[19]	Li C P, Ding C, Shen K. Quantifying the cost of context switch. In: Proceedings of the 2007 Workshop on Experimental Computer Science. 2007, 2

[20]	Muralidhara S P, Subramanian L, Mutlu O, Kandemir M, Moscibroda T. Reducing memory interference in multicore systems via applicationaware memory channel partitioning. In: Proceedings of the 44th Annual IEEE/ACMInternational Symposium on Microarchitecture. 2011, 374–385

[21]	Liu L, Li Y, Cui Z H, Bao Y G, Chen M Y, Wu C Y. Going vertical in memory management: handling multiplicity by multi-policy. In: Proceedings of the 2014 ACM/IEEE 41st International Symposium on Computer Architecture. 2014, 169–180

[22]	Lin J, Lu Q D, Ding X N, Zhang Z, Zhang X D, Sadayappan P. Gaining insights into multicore cache partitioning: bridging the gap between simulation and real systems. In: Proceedings of the 14th IEEE International Symposium on High Performance Computer Architecture. 2008, 367–378

[23]	Liu L, Cui Z H, Xing M J, Bao Y G, Chen M Y, Wu C Y. A software memory partition approach for eliminating bank-level interference in multicore systems. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. 2012, 367–376

[24]	Chang J C, Sohi G S. Cooperative cache partitioning for chip multiprocessors. In: Proceedings of the 25th Annual International Conference on Supercomputing. 2007, 242–252

[25]	Rafique N, Lim W T, Thottethodi M. Architectural support for operating system-driven CMP cache management. In: Proceedings of the 15th International Conference on Parallel Architecture and Compilation Techniques. 2006, 2–12

[26]	Suh G E, Devadas S, Rudolph L. A new memory monitoring scheme for memory-aware scheduling and partitioning. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture. 2002, 117–128

[27]	Qureshi M K, Patt Y N. Utility-based cache partitioning: a lowoverhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of the 39th International Symposium on Microarchitecture. 2006, 423–432

[28]	Zhang E Z, Jiang Y L, Shen X P. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2010, 203–212

[29]	Kim Y, Papamichael M, Mutlu O, Harchol-Balter M. Thread cluster memory scheduling: exploiting differences in memory access behavior. In: Proceedings of the 43rd Annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO). 2010, 65–76

[30]	Cook H, Moreto M, Bird S, Dao K, Patterson D A, Asanovic K. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. ACM SIGARCH Computer Architecture News, 2013, 41(3): 308–319

[31]	Xie Y J, Loh G H. Scalable shared-cache management by containing thrashing workloads. In: Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers. 2010, 262–276

[32]	Mars J, Tang L J. Whare-Map: Heterogeneity in “homogeneous” warehouse-scale computers. In: Proceedings of the 40th International Symposium on Computer Architecture (ISCA). 2013, 1–12

[33]	Zahedi S M, Lee B C. REF: resource elasticity fairness with sharing incentives for multiprocessors. ACM SIGPLAN Notices, 2014, 49(4): 145–160

[34]	Ausavarungnirun R, Chang K K W, Subramanian L, Loh G H, Mutlu O. Staged memory scheduling: achieving high performance and scalability in heterogeneous systems. ACM SIGARCH Computer Architecture News, 2012, 40(3): 416–427

[35]	Zhu Q, Wu B, Shen X P, Shen L, Wang Z Y. Understanding co-run degradations on integrated heterogeneous processors. In: Proceedings of International Workshop on Languages and Compilers for Parallel Computing (LCPC). 2014, 82–97