LRP: learned robust data partitioning for efficient processing of large dynamic queries

Pengju LIU , Pan CAI , Kai ZHONG , Cuiping LI , Hong CHEN

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (9) : 199607

PDF (4718KB)
Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (9) : 199607 DOI: 10.1007/s11704-024-40509-4
Information Systems
RESEARCH ARTICLE

LRP: learned robust data partitioning for efficient processing of large dynamic queries

Author information +
History +
PDF (4718KB)

Abstract

The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution, primarily by minimizing the number of scanned block files. Existing partitioning techniques predominantly focus on query accesses on numeric columns for constructing partitions, often overlooking non-numeric columns and thus limiting optimization potential. Additionally, these techniques, despite creating fine-grained partitions from representative queries to enhance system performance, experience from notable performance declines due to unpredictable fluctuations in future queries. To tackle these issues, we introduce LRP, a learned robust partitioning system for dynamic query processing. LRP first proposes a method for data and query encoding that captures comprehensive column access patterns from historical queries. It then employs Multi-Layer Perceptron and Long Short-Term Memory networks to predict shifts in the distribution of historical queries. To create high-quality, robust partitions based on these predictions, LRP adopts a greedy beam search algorithm for optimal partition division and implements a data redundancy mechanism to share frequently accessed data across partitions. Experimental evaluations reveal that LRP yields partitions with more stable performance under incoming queries and significantly surpasses state-of-the-art partitioning methods.

Graphical abstract

Keywords

data partitioning / data encoding / query prediction / beam search / data redundancy

Cite this article

Download citation ▾
Pengju LIU, Pan CAI, Kai ZHONG, Cuiping LI, Hong CHEN. LRP: learned robust data partitioning for efficient processing of large dynamic queries. Front. Comput. Sci., 2025, 19(9): 199607 DOI:10.1007/s11704-024-40509-4

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Taylor R W, Sacca D, Wiederhold G . Database partitioning in a cluster of processors. ACM Transactions on Database Systems (TODS), 1985, 10( 1): 29–56

[2]

Copeland G, Alexander W, Boughter E, Keller T. Data placement in bubba. In: Proceedings of 1988 ACM SIGMOD International Conference on Management of Data. 1988, 99−108

[3]

Stöhr T, Märtens H, Rahm E. Multi-dimensional database allocation for parallel data warehouses. In: Proceedings of the 26th International Conference on Very Large Data Bases. 2000, 273−284

[4]

Bentley J L . Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18( 9): 509–517

[5]

Zhan C, Su M, Wei C, Peng X, Lin L, Wang S, Chen Z, Li F, Pan Y, Zheng F, Chai C . AnalyticDB: real-time OLAP database system at alibaba cloud. Proceedings of the VLDB Endowment, 2019, 12( 12): 2059–2070

[6]

Papadomanolakis S, Ailamaki A. AutoPart: automating schema design for large scientific databases using data partitioning. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management. 2004, 383−392

[7]

Sun L, Franklin M J, Krishnan S, Xin R S. Fine-grained partitioning for aggressive data skipping. In: Proceedings of 2014 ACM SIGMOD International Conference on Management of Data. 2014, 1115−1126

[8]

ang Z, Chandramouli B, Wang C, Gehrke J, Li Y, Minhas U F, Larson P Å, Kossmann D, Acharya R. Qd-tree: learning data layouts for big data analytics. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 193−208

[9]

Li Z, Yiu M L, Chan T N. PAW: data partitioning meets workload variance. In: Proceedings of the 38th IEEE International Conference on Data Engineering. 2022, 123−135

[10]

Sun L, Franklin M J, Wang J, Wu E . Skipping-oriented partitioning for columnar layouts. Proceedings of the VLDB Endowment, 2016, 10( 4): 421–432

[11]

Li C, Markl V, Aly A M, Mahmood A R, Hassan M S, Aref W G, Ouzzani M, Elmeleegy H, Qadah T . AQWA: adaptive query workload aware partitioning of big spatial data. Proceedings of the VLDB Endowment, 2015, 8( 13): 2062–2073

[12]

Aly A M, Elmeleegy H, Qi Y, Aref W. Kangaroo: workload-aware processing of range data and range queries in hadoop. In: Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016, 397−406

[13]

Lu Y, Shanbhag A, Jindal A, Madden S . AdaptDB: adaptive partitioning for distributed joins. Proceedings of the VLDB Endowment, 2017, 10( 5): 589–600

[14]

Ding J, Minhas U F, Chandramouli B, Wang C, Li Y, Li Y, Kossmann D, Gehrke J, Kraska T. Instance-optimized data layouts for cloud analytics workloads. In: Proceedings of 2021 International Conference on Management of Data. 2021, 418−431

[15]

TPC-H benchmark. See tpc.org/tpch/ website, 1999.

[16]

Rosenblatt F . Principles of neurodynamics: perceptrons and the theory of brain mechanisms. The American Journal of Psychology, 1963, 76( 4): 705–707

[17]

Hochreiter S, Schmidhuber J . Long short-term memory. Neural Computation, 1997, 9( 8): 1735–1780

[18]

Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST). 2010, 1−10

[19]

Shanbhag A, Jindal A, Madden S, Quiane J, Elmore A J. A robust partitioning scheme for ad-hoc query workloads. In: Proceedings of 2017 Symposium on Cloud Computing. 2017, 229−241

[20]

Huang D, Liu Q, Cui Q, Fang Z, Ma X, Xu F, Shen L, Tang L, Zhou Y, Huang M, Wei W, Liu C, Zhang J, Li J, Wu X, Song L, Sun R, Yu S, Zhao L, Cameron N, Pei L, Tang X . TIDB: a raft-based HTAP database. Proceedings of the VLDB Endowment, 2020, 13( 12): 3072–3084

[21]

ClickHouse: an open-source columnar database management system. See clickhouse.com/docs/en/observability/managing-data website, 2016

[22]

Dageville B, Cruanes T, Zukowski M, Antonov V, Avanes A, Bock J, Claybaugh J, Engovatov D, Hentschel M, Huang J S, Lee A W, Motivala A, Munir A Q, Pelley S, Povinec P, Rahn G, Triantafyllis S, Unterbrunner P. The snowflake elastic data warehouse. In: Proceedings of 2016 International Conference on Management of Data. 2016, 215−226

[23]

Moerkotte G. Small materialized aggregates: a light weight index structure for data warehousing. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 476−487

[24]

Graefe G. Fast loads and fast queries. In: Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery. 2009, 111−124

[25]

Kang D, Jiang R, Blanas S. Jigsaw: a data storage and query processing engine for irregular table partitioning. In: Proceedings of 2021 International Conference on Management of Data. 2021, 898−911

[26]

han A, Yan X, Tao S, Anerousis N. Workload characterization and pre diction in the cloud: a multiple time series approach. In: Proceedings of 2012 IEEE Network Operations and Management Symposium. 2012, 1287−1294

[27]

Pavlo A, Angulo G, Arulraj J, Lin H, Lin J, Ma L, Menon P, Mowry T C, Perron M, Quah I, Santurkar S, Tomasic A, Toor S, Van Aken D, Wang Z, Wu Y, Xian R, Zhang T. Self-driving database management systems. In: Proceedings of the 8th Biennial Conference on Innovative Data Systems Research. 2017, 1

[28]

Ma L, Van Aken D, Hefny A, Mezerhane G, Pavlo A, Gordon G J. Query-based workload forecasting for self-driving database management systems. In: Proceedings of 2018 International Conference on Management of Data. 2018, 631−645

[29]

Hilprecht B, Binnig C, Röhm U. Learning a partitioning advisor for cloud databases. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 143−157

[30]

Zhou X, Li G, Feng J, Liu L, Guo W . Grep: a graph learning based database partitioning system. Proceedings of the ACM on Management of Data, 2023, 1( 1): 94

[31]

Jindal A, Dittrich J. Relax and let the database do the partitioning online. In: Proceedings of the 5th International Workshop on Business Intelligence for the Real-Time Enterprise. 2011, 65−80

[32]

Wang J, Chai C, Liu J, Li G . Face: a normalizing flow based cardinality estimator. Proceedings of the VLDB Endowment, 2021, 15( 1): 72–84

[33]

Hoerl A E, Kennard R W . Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 1970, 12( 1): 55–67

[34]

Bertino E, Atzeni P, Tan K L, Chen Y, Tay Y C, Melnik S, Gubarev A, Long J J, Romer G, Shivakumar S, Tolton M, Vassilakis T . Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 2010, 3( 1–2): 330–339

[35]

Ray: an open source framework to build and scale your ML and Python applications. See docs.ray.io/en/latest/ website, 2017

[36]

TPC-DS benchmark. See www.tpc.org/tpcds/ website, 2005

[37]

JOB benchmark. See developer.imdb.com/non-commercial-datasets/ website, 2016

[38]

ClickBench benchmark. See github.com/ClickHouse/ClickBench website, 2019

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap
PDF (4718KB)

Supplementary files

Highlights

1122

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/