
LRP: learned robust data partitioning for efficient processing of large dynamic queries
Pengju LIU, Pan CAI, Kai ZHONG, Cuiping LI, Hong CHEN
Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (9) : 199607.
LRP: learned robust data partitioning for efficient processing of large dynamic queries
The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution, primarily by minimizing the number of scanned block files. Existing partitioning techniques predominantly focus on query accesses on numeric columns for constructing partitions, often overlooking non-numeric columns and thus limiting optimization potential. Additionally, these techniques, despite creating fine-grained partitions from representative queries to enhance system performance, experience from notable performance declines due to unpredictable fluctuations in future queries. To tackle these issues, we introduce LRP, a learned robust partitioning system for dynamic query processing. LRP first proposes a method for data and query encoding that captures comprehensive column access patterns from historical queries. It then employs Multi-Layer Perceptron and Long Short-Term Memory networks to predict shifts in the distribution of historical queries. To create high-quality, robust partitions based on these predictions, LRP adopts a greedy beam search algorithm for optimal partition division and implements a data redundancy mechanism to share frequently accessed data across partitions. Experimental evaluations reveal that LRP yields partitions with more stable performance under incoming queries and significantly surpasses state-of-the-art partitioning methods.
data partitioning / data encoding / query prediction / beam search / data redundancy
Pengju Liu is currently pursuing his PhD degree at the School of Information, Renmin University of China, China. He received his B.S. degree in information management and information system from Dalian Maritime University, China, in 2020. His research interests include adaptable database partitioning, and load forecasting
Pan Cai is a PhD candidate at School of Information, Renmin University of China, China. Her research interests include query optimization, learned index, and machine learning. She is particularly interested in the structural design of learned indexes on multidimensional data
Kai Zhong is a PhD student at School of Information, Renmin University of China, China, advised by Professor Cuiping Li. He received his BS degree in Information and Computing Science at School of Information, Huazhong Agricultural University, China in June 2022. His research lie in the field of AI4DB and Machine Learning
Cuiping Li is currently a professor at the Renmin University of China, China. She received her PhD degree from Chinese Academy of Sciences, China in 2003. Prior to that, she earned her BS and MS degrees from Xi’an Jiaotong University, China in 1994 and 1997, respectively. She is a distinguished member of CCF. Her main research interests are machine learning for database and distributed query optimization
Hong Chen is currently a professor at the Renmin University of China, China. She received her PhD degree from Chinese Academy of Sciences, China in 2000. Before that, she received her BS and MS degrees from Renmin University of China, China in 1986 and 1989, respectively. She received Second Prize of the National Award for Science and Technology Progress in 2018. She is a committee member of CCF and CIC. Her research interests include database technology and high-performance computing
[1] |
Taylor R W, Sacca D, Wiederhold G . Database partitioning in a cluster of processors. ACM Transactions on Database Systems (TODS), 1985, 10( 1): 29–56
|
[2] |
Copeland G, Alexander W, Boughter E, Keller T. Data placement in bubba. In: Proceedings of 1988 ACM SIGMOD International Conference on Management of Data. 1988, 99−108
|
[3] |
Stöhr T, Märtens H, Rahm E. Multi-dimensional database allocation for parallel data warehouses. In: Proceedings of the 26th International Conference on Very Large Data Bases. 2000, 273−284
|
[4] |
Bentley J L . Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18( 9): 509–517
|
[5] |
Zhan C, Su M, Wei C, Peng X, Lin L, Wang S, Chen Z, Li F, Pan Y, Zheng F, Chai C . AnalyticDB: real-time OLAP database system at alibaba cloud. Proceedings of the VLDB Endowment, 2019, 12( 12): 2059–2070
|
[6] |
Papadomanolakis S, Ailamaki A. AutoPart: automating schema design for large scientific databases using data partitioning. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management. 2004, 383−392
|
[7] |
Sun L, Franklin M J, Krishnan S, Xin R S. Fine-grained partitioning for aggressive data skipping. In: Proceedings of 2014 ACM SIGMOD International Conference on Management of Data. 2014, 1115−1126
|
[8] |
ang Z, Chandramouli B, Wang C, Gehrke J, Li Y, Minhas U F, Larson P Å, Kossmann D, Acharya R. Qd-tree: learning data layouts for big data analytics. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 193−208
|
[9] |
Li Z, Yiu M L, Chan T N. PAW: data partitioning meets workload variance. In: Proceedings of the 38th IEEE International Conference on Data Engineering. 2022, 123−135
|
[10] |
Sun L, Franklin M J, Wang J, Wu E . Skipping-oriented partitioning for columnar layouts. Proceedings of the VLDB Endowment, 2016, 10( 4): 421–432
|
[11] |
Li C, Markl V, Aly A M, Mahmood A R, Hassan M S, Aref W G, Ouzzani M, Elmeleegy H, Qadah T . AQWA: adaptive query workload aware partitioning of big spatial data. Proceedings of the VLDB Endowment, 2015, 8( 13): 2062–2073
|
[12] |
Aly A M, Elmeleegy H, Qi Y, Aref W. Kangaroo: workload-aware processing of range data and range queries in hadoop. In: Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016, 397−406
|
[13] |
Lu Y, Shanbhag A, Jindal A, Madden S . AdaptDB: adaptive partitioning for distributed joins. Proceedings of the VLDB Endowment, 2017, 10( 5): 589–600
|
[14] |
Ding J, Minhas U F, Chandramouli B, Wang C, Li Y, Li Y, Kossmann D, Gehrke J, Kraska T. Instance-optimized data layouts for cloud analytics workloads. In: Proceedings of 2021 International Conference on Management of Data. 2021, 418−431
|
[15] |
TPC-H benchmark. See tpc.org/tpch/ website, 1999.
|
[16] |
Rosenblatt F . Principles of neurodynamics: perceptrons and the theory of brain mechanisms. The American Journal of Psychology, 1963, 76( 4): 705–707
|
[17] |
Hochreiter S, Schmidhuber J . Long short-term memory. Neural Computation, 1997, 9( 8): 1735–1780
|
[18] |
Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST). 2010, 1−10
|
[19] |
Shanbhag A, Jindal A, Madden S, Quiane J, Elmore A J. A robust partitioning scheme for ad-hoc query workloads. In: Proceedings of 2017 Symposium on Cloud Computing. 2017, 229−241
|
[20] |
Huang D, Liu Q, Cui Q, Fang Z, Ma X, Xu F, Shen L, Tang L, Zhou Y, Huang M, Wei W, Liu C, Zhang J, Li J, Wu X, Song L, Sun R, Yu S, Zhao L, Cameron N, Pei L, Tang X . TIDB: a raft-based HTAP database. Proceedings of the VLDB Endowment, 2020, 13( 12): 3072–3084
|
[21] |
ClickHouse: an open-source columnar database management system. See clickhouse.com/docs/en/observability/managing-data website, 2016
|
[22] |
Dageville B, Cruanes T, Zukowski M, Antonov V, Avanes A, Bock J, Claybaugh J, Engovatov D, Hentschel M, Huang J S, Lee A W, Motivala A, Munir A Q, Pelley S, Povinec P, Rahn G, Triantafyllis S, Unterbrunner P. The snowflake elastic data warehouse. In: Proceedings of 2016 International Conference on Management of Data. 2016, 215−226
|
[23] |
Moerkotte G. Small materialized aggregates: a light weight index structure for data warehousing. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 476−487
|
[24] |
Graefe G. Fast loads and fast queries. In: Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery. 2009, 111−124
|
[25] |
Kang D, Jiang R, Blanas S. Jigsaw: a data storage and query processing engine for irregular table partitioning. In: Proceedings of 2021 International Conference on Management of Data. 2021, 898−911
|
[26] |
han A, Yan X, Tao S, Anerousis N. Workload characterization and pre diction in the cloud: a multiple time series approach. In: Proceedings of 2012 IEEE Network Operations and Management Symposium. 2012, 1287−1294
|
[27] |
Pavlo A, Angulo G, Arulraj J, Lin H, Lin J, Ma L, Menon P, Mowry T C, Perron M, Quah I, Santurkar S, Tomasic A, Toor S, Van Aken D, Wang Z, Wu Y, Xian R, Zhang T. Self-driving database management systems. In: Proceedings of the 8th Biennial Conference on Innovative Data Systems Research. 2017, 1
|
[28] |
Ma L, Van Aken D, Hefny A, Mezerhane G, Pavlo A, Gordon G J. Query-based workload forecasting for self-driving database management systems. In: Proceedings of 2018 International Conference on Management of Data. 2018, 631−645
|
[29] |
Hilprecht B, Binnig C, Röhm U. Learning a partitioning advisor for cloud databases. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 143−157
|
[30] |
Zhou X, Li G, Feng J, Liu L, Guo W . Grep: a graph learning based database partitioning system. Proceedings of the ACM on Management of Data, 2023, 1( 1): 94
|
[31] |
Jindal A, Dittrich J. Relax and let the database do the partitioning online. In: Proceedings of the 5th International Workshop on Business Intelligence for the Real-Time Enterprise. 2011, 65−80
|
[32] |
Wang J, Chai C, Liu J, Li G . Face: a normalizing flow based cardinality estimator. Proceedings of the VLDB Endowment, 2021, 15( 1): 72–84
|
[33] |
Hoerl A E, Kennard R W . Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 1970, 12( 1): 55–67
|
[34] |
Bertino E, Atzeni P, Tan K L, Chen Y, Tay Y C, Melnik S, Gubarev A, Long J J, Romer G, Shivakumar S, Tolton M, Vassilakis T . Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 2010, 3( 1–2): 330–339
|
[35] |
Ray: an open source framework to build and scale your ML and Python applications. See docs.ray.io/en/latest/ website, 2017
|
[36] |
TPC-DS benchmark. See www.tpc.org/tpcds/ website, 2005
|
[37] |
JOB benchmark. See developer.imdb.com/non-commercial-datasets/ website, 2016
|
[38] |
ClickBench benchmark. See github.com/ClickHouse/ClickBench website, 2019
|
/
〈 |
|
〉 |