D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems
Jialun WANG, Wenhao PANG, Chuliang WENG, Aoying ZHOU
D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems
In analytical queries, a number of important operators like JOIN and GROUP BY are suitable for parallelization, and GPU is an ideal accelerator considering its power of parallel computing. However, when data size increases to hundreds of gigabytes, one GPU card becomes insufficient due to the small capacity of global memory and the slow data transfer between host and device. A straightforward solution is to equip more GPUs linked with high-bandwidth connectors, but the cost will be highly increased. We utilize unified memory (UM) produced by NVIDIA CUDA (Compute Unified Device Architecture) to make it possible to accelerate large-scale queries on just one GPU, but we notice that the transfer performance between host and UM, which happens before kernel execution, is often significantly slower than the theoretical bandwidth. An important reason is that, in single-GPU environment, data processing systems usually invoke only one or a static number of threads for data copy, leading to an inefficient transfer which slows down the overall performance heavily. In this paper, we present D-Cubicle, a runtime module to accelerate data transfer between host-managed memory and unified memory. D-Cubicle boosts the actual transfer speed dynamically through a self-adaptive approach. In our experiments, taking data transfer into account, D-Cubicle processes 200 GB of data on a single GPU with 32 GB of global memory, achieving 1.43x averagely and 2.09x maximally the performance of the baseline system.
data analytics / GPU / unified memory
Jialun Wang is a PhD candidate in the School of Data Science and Engineering, East China Normal University, China. He received his bachelor’s degree in computer science from Sichuan University, China. His research interests include parallel and distributed systems, CPU-GPU heterogeneous systems, and in-memory data processing
Wenhao Pang is currently working toward the master’s degree in the School of Data Science and Engineering, East China Normal University, China. He received his bachelor’s degree from Donghua University, China. His research interests include parallel and distributed systems, and heterogeneous computing
Chuliang Weng is a full professor on computer science at East China Normal University (ECNU), China. Before joining ECNU, he worked at Huawei Central Research Institute as a principal researcher and technical director. Before joining Huawei, he was an associate professor in the Department of Computer Science and Engineering at Shanghai Jiao Tong University, China. He was also a visiting research scientist in the Department of Computer Science at Columbia University, USA. His interests center on building fast and secure systems. His research includes parallel and distributed systems, system virtualization and cloud, storage systems, operating systems, and system security
Aoying Zhou, Professor at School of Data Science and Engineering (DaSE), East China Normal University (ECNU), China. He is CCF Fellow, Vice President of Shanghai Computer Society, and Associate Editor-in-Chief of Chinese Journal of Computer. His research interests include database, data management, digital transformation, and data-driven applications such as Financial Technology (FinTech), Education Technology (EduTech), and Logistics Technology (LogTech)
[1] |
Rosenfeld V, Breß S, Markl V . Query processing on heterogeneous CPU/GPU systems. ACM Computing Surveys, 2023, 55( 1): 11
|
[2] |
Kaldewey T, Lohman G, Mueller R, Volk P. GPU join processing revisited. In: Proceedings of the 8th International Workshop on Data Management on New Hardware. 2012, 55–62
|
[3] |
Rui R, Tu Y C. Fast Equi-join algorithms on GPUs: design and implementation. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management. 2017, 17
|
[4] |
Chrysogelos P, Sioulas P, Ailamaki A. Hardware-conscious query processing in GPU-accelerated analytical engines. In: Proceedings of the 9th Biennial Conference on Innovative Data Systems Research. 2019, 1–9
|
[5] |
Sioulas P, Chrysogelos P, Karpathiotakis M, Appuswamy R, Ailamaki A. Hardware-conscious hash-joins on GPUs. In: Proceedings of the 35th IEEE International Conference on Data Engineering. 2019, 698–709
|
[6] |
Chrysogelos P, Karpathiotakis M, Appuswamy R, Ailamaki A . HetExchange: encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proceedings of the VLDB Endowment, 2019, 12( 5): 544–556
|
[7] |
Paul J, He B, Lu S, Lau C T . Revisiting hash join on graphics processors: a decade later. Distributed and Parallel Databases, 2020, 38( 4): 771–793
|
[8] |
Nam Y M N, Han D H, Kim M S K. SPRINTER: a fast n-ary join query processing method for complex OLAP queries. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 2055–2070
|
[9] |
Paul J, Lu S, He B, Lau C T. MG-Join: a scalable join for massively parallel multi-GPU architectures. In: Proceedings of 2021 International Conference on Management of Data. 2021, 1413–1425
|
[10] |
Jung J, Park D, Do Y, Park J, Lee J. Overlapping host-to-device copy and computation using hidden unified memory. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2020, 321–335
|
[11] |
Koliousis A, Weidlich M, Fernandez R C, Wolf A L, Costa P, Pietzuch P. SABER: window-based hybrid stream processing for heterogeneous architectures. In: Proceedings of 2016 International Conference on Management of Data. 2016, 555–569
|
[12] |
Arefyeva I, Broneske D, Campero G, Pinnecke M, Saake G. Memory management strategies in CPU/GPU database systems: a survey. In: Proceedings of the 14th International Conference on Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety. 2018, 128–142
|
[13] |
Li A, Song S L, Chen J, Li J, Liu X, Tallent N R, Barker K J . Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 1): 94–110
|
[14] |
Li L, Chapman B. Compiler assisted hybrid implicit and explicit GPU memory management under unified address space. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2019, 51
|
[15] |
Lutz C, Breß S, Zeuch S, Rabl T, Markl V. Pump up the volume: processing large data on GPUs with fast interconnects. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 1633–1649
|
[16] |
Lutz C, Breß S, Zeuch S, Rabl T, Markl V. Triton join: efficiently scaling to a large join state on GPUs with fast interconnects. In: Proceedings of 2022 International Conference on Management of Data. 2022, 1017–1032
|
[17] |
Kim H, Sim J, Gera P, Hadidi R, Kim H. Batch-aware unified memory management in GPUs for irregular workloads. In: Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. 2020, 1357–1370
|
[18] |
Lee R, Zhou M, Li C, Hu S, Teng J, Li D, Zhang X . The art of balance: a RateupDB™ experience of building a CPU/GPU hybrid database product. Proceedings of the VLDB Endowment, 2021, 14( 12): 2999–3013
|
[19] |
Jung J, Park D, Jo G, Park J, Lee J. SnuRHAC: a runtime for heterogeneous accelerator clusters with CUDA unified memory. In: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing. 2021, 107–120
|
[20] |
Cho S, Hong J, Choi J, Han H. Multithreaded double queuing for balanced CPU-GPU memory copying. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. 2019, 1444–1450
|
[21] |
He B, Lu M, Yang K, Fang R, Govindaraju N K, Luo Q, Sander P V . Relational query coprocessing on graphics processors. ACM Transactions on Database Systems, 2009, 34( 4): 21
|
[22] |
Wang K, Zhang K, Yuan Y, Ma S, Lee R, Ding X, Zhang X . Concurrent analytical query processing with GPUs. Proceedings of the VLDB Endowment, 2014, 7( 11): 1011–1022
|
[23] |
Paul J, He J, He B. GPL: A GPU-based pipelined query processing engine. In: Proceedings of the 2016 International Conference on Management of Data. 2016, 1935–1950
|
[24] |
Breß S . The design and implementation of CoGaDB: a column-oriented GPU-accelerated DBMS. Datenbank-Spektrum, 2014, 14( 3): 199–209
|
[25] |
Breß S, Saake G . Why it is time for a HyPE: a hybrid query processing engine for efficient GPU coprocessing in DBMS. Proceedings of the VLDB Endowment, 2013, 6( 12): 1398–1403
|
[26] |
Breß S, Köcher B, Heimel M, Markl V, Saecker M, Saake G . Ocelot/HyPE: optimized data processing on heterogeneous hardware. Proceedings of the VLDB Endowment, 2014, 7( 13): 1609–1612
|
[27] |
Guo C, Chen H, Zhang F, Li C. Distributed join algorithms on multi-CPU clusters with GPUDirect RDMA. In: Proceedings of the 48th International Conference on Parallel Processing. 2019, 65
|
[28] |
Rui R, Li H, Tu Y C . Efficient join algorithms for large database tables in a multi-GPU environment. Proceedings of the VLDB Endowment, 2020, 14( 4): 708–720
|
[29] |
Hou N, He F, Zhou Y, Chen Y . An efficient GPU-based parallel tabu search algorithm for hardware/software co-design. Frontiers of Computer Science, 2020, 14( 5): 145316
|
[30] |
Chen Y, He F, Li H, Zhang D, Wu Y . A full migration BBO algorithm with enhanced population quality bounds for multimodal biomedical image registration. Applied Soft Computing, 2020, 93: 106335
|
[31] |
Liang Y, He F, Zeng X, Luo J . An improved loop subdivision to coordinate the smoothness and the number of faces via multi-objective optimization. Integrated Computer-Aided Engineering, 2022, 29( 1): 23–41
|
/
〈 | 〉 |