LRP: learned robust data partitioning for efficient processing of large dynamic queries
Pengju LIU , Pan CAI , Kai ZHONG , Cuiping LI , Hong CHEN
Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (9) : 199607
LRP: learned robust data partitioning for efficient processing of large dynamic queries
The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution, primarily by minimizing the number of scanned block files. Existing partitioning techniques predominantly focus on query accesses on numeric columns for constructing partitions, often overlooking non-numeric columns and thus limiting optimization potential. Additionally, these techniques, despite creating fine-grained partitions from representative queries to enhance system performance, experience from notable performance declines due to unpredictable fluctuations in future queries. To tackle these issues, we introduce LRP, a learned robust partitioning system for dynamic query processing. LRP first proposes a method for data and query encoding that captures comprehensive column access patterns from historical queries. It then employs Multi-Layer Perceptron and Long Short-Term Memory networks to predict shifts in the distribution of historical queries. To create high-quality, robust partitions based on these predictions, LRP adopts a greedy beam search algorithm for optimal partition division and implements a data redundancy mechanism to share frequently accessed data across partitions. Experimental evaluations reveal that LRP yields partitions with more stable performance under incoming queries and significantly surpasses state-of-the-art partitioning methods.
data partitioning / data encoding / query prediction / beam search / data redundancy
| [1] |
|
| [2] |
Copeland G, Alexander W, Boughter E, Keller T. Data placement in bubba. In: Proceedings of 1988 ACM SIGMOD International Conference on Management of Data. 1988, 99−108 |
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
Sun L, Franklin M J, Krishnan S, Xin R S. Fine-grained partitioning for aggressive data skipping. In: Proceedings of 2014 ACM SIGMOD International Conference on Management of Data. 2014, 1115−1126 |
| [8] |
ang Z, Chandramouli B, Wang C, Gehrke J, Li Y, Minhas U F, Larson P Å, Kossmann D, Acharya R. Qd-tree: learning data layouts for big data analytics. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 193−208 |
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
Ding J, Minhas U F, Chandramouli B, Wang C, Li Y, Li Y, Kossmann D, Gehrke J, Kraska T. Instance-optimized data layouts for cloud analytics workloads. In: Proceedings of 2021 International Conference on Management of Data. 2021, 418−431 |
| [15] |
TPC-H benchmark. See tpc.org/tpch/ website, 1999. |
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
Shanbhag A, Jindal A, Madden S, Quiane J, Elmore A J. A robust partitioning scheme for ad-hoc query workloads. In: Proceedings of 2017 Symposium on Cloud Computing. 2017, 229−241 |
| [20] |
|
| [21] |
ClickHouse: an open-source columnar database management system. See clickhouse.com/docs/en/observability/managing-data website, 2016 |
| [22] |
Dageville B, Cruanes T, Zukowski M, Antonov V, Avanes A, Bock J, Claybaugh J, Engovatov D, Hentschel M, Huang J S, Lee A W, Motivala A, Munir A Q, Pelley S, Povinec P, Rahn G, Triantafyllis S, Unterbrunner P. The snowflake elastic data warehouse. In: Proceedings of 2016 International Conference on Management of Data. 2016, 215−226 |
| [23] |
|
| [24] |
|
| [25] |
Kang D, Jiang R, Blanas S. Jigsaw: a data storage and query processing engine for irregular table partitioning. In: Proceedings of 2021 International Conference on Management of Data. 2021, 898−911 |
| [26] |
han A, Yan X, Tao S, Anerousis N. Workload characterization and pre diction in the cloud: a multiple time series approach. In: Proceedings of 2012 IEEE Network Operations and Management Symposium. 2012, 1287−1294 |
| [27] |
|
| [28] |
Ma L, Van Aken D, Hefny A, Mezerhane G, Pavlo A, Gordon G J. Query-based workload forecasting for self-driving database management systems. In: Proceedings of 2018 International Conference on Management of Data. 2018, 631−645 |
| [29] |
Hilprecht B, Binnig C, Röhm U. Learning a partitioning advisor for cloud databases. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 143−157 |
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
Ray: an open source framework to build and scale your ML and Python applications. See docs.ray.io/en/latest/ website, 2017 |
| [36] |
TPC-DS benchmark. See www.tpc.org/tpcds/ website, 2005 |
| [37] |
JOB benchmark. See developer.imdb.com/non-commercial-datasets/ website, 2016 |
| [38] |
ClickBench benchmark. See github.com/ClickHouse/ClickBench website, 2019 |
Higher Education Press
/
| 〈 |
|
〉 |