Transparent partial page migration between CPU and GPU

Shiqing ZHANG, Zheng QIN, Yaohua YANG, Li SHEN, Zhiying WANG

PDF(934 KB)
PDF(934 KB)
Front. Comput. Sci. ›› 2020, Vol. 14 ›› Issue (3) : 143101. DOI: 10.1007/s11704-018-7386-4
RESEARCH ARTICLE

Transparent partial page migration between CPU and GPU

Author information +
History +

Abstract

Despite the increasing investment in integrated GPU and next-generation interconnect research, discrete GPU connected by PCIe still account for the dominant position of the market, the management of data communication between CPU and GPU continues to evolve. Initially, the programmer explicitly controls the data transfer between CPU and GPU. To simplify programming and enable systemwide atomic memory operations, GPU vendors have developed a programming model that provides a single, virtual address space for accessing all CPU and GPU memories in the system. The page migration engine in this model automatically migrates pages between CPU and GPU on demand. To meet the needs of high-performanceworkloads, the page size tends to be larger. Limited by low bandwidth and high latency interconnects compared to GDDR, larger page migration has longer delay, which may reduce the overlap of computation and transmission, waste time to migrate unrequested data, block subsequent requests, and cause serious performance decline. In this paper, we propose partial page migration that only migrates the requested part of a page to reduce the migration unit, shorten the migration latency, and avoid the performance degradation of the full page migration when the page becomes larger. We show that partial page migration is possible to largely hide the performance overheads of full page migration. Compared with programmer controlled data transmission, when the page size is 2MB and the PCIe bandwidth is 16GB/sec, full page migration is 72.72× slower, while our partial page migration achieves 1.29× speedup. When the PCIe bandwidth is changed to 96GB/sec, full page migration is 18.85× slower, while our partial page migration provides 1.37× speedup. Additionally, we examine the performance impact that PCIe bandwidth and migration unit size have on execution time, enabling designers to make informed decisions.

Keywords

unified memory / data communication / page migration / partial page migration

Cite this article

Download citation ▾
Shiqing ZHANG, Zheng QIN, Yaohua YANG, Li SHEN, Zhiying WANG. Transparent partial page migration between CPU and GPU. Front. Comput. Sci., 2020, 14(3): 143101 https://doi.org/10.1007/s11704-018-7386-4

References

[1]
Harris M. Unified memory in CUDA 6. GTC On-Demand, NVIDIA, 2013
[2]
Lindholm E, Nickolls J, Oberman S, Montrym J. Nvidia tesla: a unified graphics and computing architecture. IEEE Micro, 2008, 28(2): 39–55
[3]
Di Carlo S, Gambardella G, Martella I, Prinetto P, Rolfo D, Trotta P. Fault mitigation strategies for CUDA GPUs. In: Proceedings of IEEE International Test Conference. 2013, 1–8
[4]
Power J, Hill M D, Wood D A. Supporting x86-64 address translation for 100s of GPU lanes. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2014, 568–578
[5]
Landaverde R, Zhang T, Coskun A K, Herbordt M. An investigation of unified memory access performance in CUDA. In: Proceedings of IEEE High Performance Extreme Computing Conference. 2014, 1–6
[6]
Zheng T, Nellans D, Zulfiqar A, Stephenson M, Keckler S W. Towards high performance paged memory for GPUs. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2016, 345–357
[7]
Lustig D, Martonosi M. Reducing GPU offload latency via fine-grained CPU-GPU synchronization. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2013, 354–365
[8]
Kirk D. NvidiaCUDA software and GPU parallel computing architecture. In: Proceedings of International Symposium on Memory Management. 2007, 103–104
[9]
Patterson D. The top 10 innovations in the new nvidia fermi architecture, and the top 3 next challenges. Nvidia Whitepaper, 2009, 47
[10]
Hammarlund P, Martinez A J, Bajwa A A, Hill D L, Hallnor E, Jiang H, Dixon M, Derr M, Hunsaker M, Kumar R, Osborne R B, Rajwar R, Singhal R, D’Sa R, Chappell R, Kaushik S, Chennupaty S, Jourdun S, Gunther S, Piazza T, Burton T. Haswell: the fourth-generation intel core processor. IEEE Micro, 2014, 34(2): 6–20
[11]
Ghorpade J, Parande J, Kulkarni M, Bawaskar A. GGGPU processing in CUDA architecture. Advanced Computing, 2012, 3(1): 105
[12]
Rogers P. Heterogeneous system architecture overview. In: Proceedings of Hot Chip: A Symposium on High Performance Chips. 2013
[13]
Kim Y, Lee J, Kim D, Kim J. Scalegpu: GPU architecture for memoryunaware GPU programming. IEEE Computer Architecture Letters, 2014, 13(2): 101–104
[14]
Cao Y, Chen L, Zhang Z. Flexible memory: a novel main memory architecture with block-level memory compression. In: Proceedings of IEEE International Conference on Networking, Architecture and Storage. 2015, 285–294
[15]
Agarwal N, Nellans D, Stephenson M, O’Connor M, Keckler S W. Page placement strategies for GPUs within heterogeneous memory systems. ACM SIGPLAN Notices, 2015, 50: 607–618
[16]
Agarwal N, Nellans D, O’Connor M, Keckler S W, Wenisch T F. Unlocking bandwidth for GPUs in CC-NUMA systems. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2015, 354–365
[17]
Keckler SW, Dally W J, Khailany B, Garland M, Glasco D. GPUs and the future of parallel computing. IEEE Micro, 2011, 31(5): 7–17
[18]
Vesely J, Basu A, Oskin M, Loh G H, Bhattacharjee A. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software. 2016, 161–171
[19]
Awasthi M, Nellans D, Sudan K, Balasubramonian R, Davis A. Handling the problems and opportunities posed by multiple on-chip memory controllers. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques. 2010, 319–330
[20]
Pattnaik A, Tang X, Jog A, Kayiran O, Mishra A K, Kandemir M T, Mutlu O, Das C R. Scheduling techniques for GPU architectures with processing-in-memory capabilities. In: Proceedings of International Conference on Parallel Architecture and Compilation Techniques. 2016, 31–44
[21]
Chan C, Didem D Unat M, Lijewski W, Zhang J, Bell J, Shalf. Software design space exploration for exascale combustion co-design. In: Proceedings of International Supercomputing Conference. 2013, 196–212
[22]
Dashti M, Fedorova A. Analyzing memory management methods on integrated CPU-GPU systems. In: Proceedings of ACM SIGPLAN International Symposium on Memory Management. 2017, 59–69
[23]
Bakhoda A, Yuan G L, Fung WWL, Wong H, Aamodt TM. Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software. 2009, 163–174
[24]
Aamodt T M, Fung W W L, Singh I, El-Shafiey A, Kwa J, Hetherington T, Gubran A, Boktor A,Rogers T, Bakhoda A. GPGPU-Sim 3.x manual. Retrieved February, 2013, 1: 2015
[25]
Ajanovic J. PCI express 3.0 overview. In: Proceedings of Hot Chips: A Symposium on High Performance Chips. 2009
[26]
Gonzales D. PCI express 4.0 electrical previews. In: Proceedings of PCI-SIG Developers Conference. 2015

RIGHTS & PERMISSIONS

2019 Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature
AI Summary AI Mindmap
PDF(934 KB)

Accesses

Citations

Detail

Sections
Recommended

/