D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems
Jialun WANG , Wenhao PANG , Chuliang WENG , Aoying ZHOU
Front. Comput. Sci. ›› 2023, Vol. 17 ›› Issue (4) : 174610
In analytical queries, a number of important operators like JOIN and GROUP BY are suitable for parallelization, and GPU is an ideal accelerator considering its power of parallel computing. However, when data size increases to hundreds of gigabytes, one GPU card becomes insufficient due to the small capacity of global memory and the slow data transfer between host and device. A straightforward solution is to equip more GPUs linked with high-bandwidth connectors, but the cost will be highly increased. We utilize unified memory (UM) produced by NVIDIA CUDA (Compute Unified Device Architecture) to make it possible to accelerate large-scale queries on just one GPU, but we notice that the transfer performance between host and UM, which happens before kernel execution, is often significantly slower than the theoretical bandwidth. An important reason is that, in single-GPU environment, data processing systems usually invoke only one or a static number of threads for data copy, leading to an inefficient transfer which slows down the overall performance heavily. In this paper, we present D-Cubicle, a runtime module to accelerate data transfer between host-managed memory and unified memory. D-Cubicle boosts the actual transfer speed dynamically through a self-adaptive approach. In our experiments, taking data transfer into account, D-Cubicle processes 200 GB of data on a single GPU with 32 GB of global memory, achieving 1.43x averagely and 2.09x maximally the performance of the baseline system.
data analytics / GPU / unified memory
| [1] |
|
| [2] |
Kaldewey T, Lohman G, Mueller R, Volk P. GPU join processing revisited. In: Proceedings of the 8th International Workshop on Data Management on New Hardware. 2012, 55–62 |
| [3] |
|
| [4] |
|
| [5] |
Sioulas P, Chrysogelos P, Karpathiotakis M, Appuswamy R, Ailamaki A. Hardware-conscious hash-joins on GPUs. In: Proceedings of the 35th IEEE International Conference on Data Engineering. 2019, 698–709 |
| [6] |
|
| [7] |
|
| [8] |
Nam Y M N, Han D H, Kim M S K. SPRINTER: a fast n-ary join query processing method for complex OLAP queries. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 2055–2070 |
| [9] |
Paul J, Lu S, He B, Lau C T. MG-Join: a scalable join for massively parallel multi-GPU architectures. In: Proceedings of 2021 International Conference on Management of Data. 2021, 1413–1425 |
| [10] |
|
| [11] |
Koliousis A, Weidlich M, Fernandez R C, Wolf A L, Costa P, Pietzuch P. SABER: window-based hybrid stream processing for heterogeneous architectures. In: Proceedings of 2016 International Conference on Management of Data. 2016, 555–569 |
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
Lutz C, Breß S, Zeuch S, Rabl T, Markl V. Pump up the volume: processing large data on GPUs with fast interconnects. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 1633–1649 |
| [16] |
Lutz C, Breß S, Zeuch S, Rabl T, Markl V. Triton join: efficiently scaling to a large join state on GPUs with fast interconnects. In: Proceedings of 2022 International Conference on Management of Data. 2022, 1017–1032 |
| [17] |
Kim H, Sim J, Gera P, Hadidi R, Kim H. Batch-aware unified memory management in GPUs for irregular workloads. In: Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. 2020, 1357–1370 |
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
Higher Education Press
Supplementary files
/
| 〈 |
|
〉 |