LRP: learned robust data partitioning for efficient processing of large dynamic queries

Pengju LIU , Pan CAI , Kai ZHONG , Cuiping LI , Hong CHEN

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (9) : 199607

PDF (4718KB)
Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (9) : 199607 DOI: 10.1007/s11704-024-40509-4
Information Systems
RESEARCH ARTICLE

LRP: learned robust data partitioning for efficient processing of large dynamic queries

Author information +
History +
PDF (4718KB)

Abstract

The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution, primarily by minimizing the number of scanned block files. Existing partitioning techniques predominantly focus on query accesses on numeric columns for constructing partitions, often overlooking non-numeric columns and thus limiting optimization potential. Additionally, these techniques, despite creating fine-grained partitions from representative queries to enhance system performance, experience from notable performance declines due to unpredictable fluctuations in future queries. To tackle these issues, we introduce LRP, a learned robust partitioning system for dynamic query processing. LRP first proposes a method for data and query encoding that captures comprehensive column access patterns from historical queries. It then employs Multi-Layer Perceptron and Long Short-Term Memory networks to predict shifts in the distribution of historical queries. To create high-quality, robust partitions based on these predictions, LRP adopts a greedy beam search algorithm for optimal partition division and implements a data redundancy mechanism to share frequently accessed data across partitions. Experimental evaluations reveal that LRP yields partitions with more stable performance under incoming queries and significantly surpasses state-of-the-art partitioning methods.

Graphical abstract

Keywords

data partitioning / data encoding / query prediction / beam search / data redundancy

Cite this article

Download citation ▾
Pengju LIU, Pan CAI, Kai ZHONG, Cuiping LI, Hong CHEN. LRP: learned robust data partitioning for efficient processing of large dynamic queries. Front. Comput. Sci., 2025, 19(9): 199607 DOI:10.1007/s11704-024-40509-4

登录浏览全文

4963

注册一个新账户 忘记密码

1 Introduction

Data partitioning is a pervasive concept in our daily, with a simple analogy being the organization of goods within supermarkets into distinct sections for easier navigation. This principle is extensively adopted in database management systems (DBMS) as a crucial step in database physical design [13] to enhance the efficiency of data retrieval and management. Horizontal partitioning (HP) [49] stands out as a critical branch of data partitioning that aims for minimal row granularity during data division to optimize data management. It involves selecting commonly used column distributions or mining the query-data access relationship as partitioning features to finely allocate table data into smaller partition files, thereby accelerating queries. For example, in Fig.1①, if a two-dimensional tablespace D (400 MB in size) is partitioned based on the areas targeted by queries (black dots), we only require accessing two partitions R1 and R2 (165 MB in total) to fetch all the required data. This method circumvents the need to scan the entire tablespace, which often occurs when using simplistic data-driven partitioning strategies, such as uniform or random division of the tablespace.

Recent HP studies [714] regard query acceleration as the primary optimization objective, extracting valuable predicate conditions to greedily split given D for maximizing data skipping. However, there are two major limitations:

(1) Current research only considers numeric column-related predicates, making the split partitions not fully adaptable to real querying areas. For example, by comparing Fig.1① and 1②, the measured querying boundaries (black dots) are larger than the real querying ones (green dots) due to the neglect of non-numeric predicates. This discrepancy results in an inaccurate data layout (solid boxes, 165 MB) that scans an additional 25 MB of data compared to the optimal layout (dotted boxes, 145 MB). Thus, considering predicates related to non-numeric data is crucial for partitioning, especially for tables like TPC-H [15], which have a few numeric columns. In such tables, 73.6% of the columns are non-numeric, and 86% of queries utilize these columns in filter conditions.

(2) Another limitation is that most studies demonstrate superior performance for static queries but are difficult to adapt to query shifts. Only a few studies [11,13,14] adapt to dynamic query workload through periodic data re-partitioning, which can be a costly operation for DBMS. For example, in Fig.1③, we have created two partitions R1 and R2, for old queries (black dots) in D. However, as new queries (red dots) arrive, the amount of scanned data can increase from 165 to 400 MB. This occurs because the partially queried areas outside R1 and R2 are randomly allocated to different partitions, potentially requiring a search of the entire table. By assessing the similarity of querying boundaries between new and old queries (introduced from [9]), we categorize new queries into two types: similarity queries (red dotted boxes) and exploratory queries (other red dots). If the robust partition R1 is created in advance for incoming similarity and exploratory queries (see Fig.1④), we can save 215 MB of data scanning without incurring the expensive re-partitioning required to update R1 for query shifts.

Overall, there are three primary challenges. The first challenge arises from the absence of non-numeric-type predicates, which alters the candidate split condition set for partitioning and consequently impacts the order in which the tablespace is split. Hence, how to fully incorporate the complete predicate conditions in the partition construction process (C1) is crucial. The second challenge involves the dynamic nature of collected workloads; it is essential to predict query distribution changes accurately to create robust partitions (C2) that mitigate the need for frequent re-partitioning. The third challenge, a common objective in previous studies, is refining predicate-based tablespace partitioning (C3) to improve query performance.

Our proposed solution. We propose a Learned Robust Partitioning system (LRP) that utilizes historical query logs to generate robust data layouts. Firstly, LRP identifies queries with similar characteristics submitted at distinct times, then designs complete data encoding schemes and a logical tree structure to embed these raw queries into structured query vectors. This vector representation incorporates various types of predicate conditions and their logical relationships (addressing C1). Secondly, LRP employs two neural networks, multi-layer perceptron (MLP) [16] and long short-term memory (LSTM) [17], selected based on the temporal features of the queries, to predict incoming similarity queries. LRP also introduces a novel loss function that replaces the traditional Mean Squared Error (MSE) with a partition semantic correlation error to improve prediction accuracy (addressing C2). Thirdly, LRP decodes the predicted query vectors into candidate predicates and utilizes a beam search algorithm to determine the locally optimal predicate condition for each tablespace split. Furthermore, LRP introduces an effective data redundancy mechanism that replicates frequently accessed data to minimize query access contention at each partition (addressing C3). To enhance the robustness of fixed layouts against exploratory queries, LRP employs a data-driven KD-Tree [4] to partition the remaining tablespace after predicate-based splitting.

It is important to note that, similar to QdTree [8] and PAW [9], our approach does not directly optimize multi-table queries, such as their join phases. Instead, it indirectly optimizes them by reducing the number of blocks required for joins.

Contributions. In summary, we make the following contributions:

● We propose multiple data and query encoding schemes to capture the often-overlooked access features of non-numeric columns. Then, we present a logical tree for the transformation of raw queries into vectors.

● We design two optional predictive networks, along with a loss function integrating partition semantics, to accurately predict changes in similarity queries, ensuring robust layout creation over these predictions.

● We employ beam search and KD-Tree strategies to find the optimal predicate allocation order for splitting a given tablespace. Moreover, we design a data redundancy policy that aims to minimize query access contention at a negligible storage cost.

● Through rigorous testing on benchmark datasets, we demonstrate that our method outperforms existing techniques in both performance and robustness.

2 Preliminaries

In this section, we introduce the relevant concepts of queries and partitioning, followed by defining the two main problems to be addressed. Next, we present the related work. Tab.1 summarizes the necessary notations used in this paper.

2.1 Similarity queries

The DBMS optimizer generates a query plan to execute a user query efficiently. Within this plan, filter operations are typically executed first, determining the satisfied data rows through predicate pushdown. Given the complete predicates extracted from a query q and a table dataset D with z columns c1,...,cz, we can characterize its query feature as the column domains for the satisfied rows, represented in vector form HqR2z, i.e.,

Hq=[li(q),ui(q)]i=1z,

where li(q),ui(q) denote the lower and upper bounds of the ith column domain for the queried data by q. Similarly, we represent the domains of the entire table as HD=[li,ui]i=0zR2z, where li,ui denote the lower and upper bounds of the ith column data.

In real-world scenarios, many new and old queries often share the same filter columns and similar predicate conditions [9]. This leads to certain rows being repeatedly accessed by user queries, as illustrated by the red dotted boxes in Fig.1③. We define two optional metrics to measure the similarity feature between queries q1,q2:

1) The first optional similarity metric is the overlapping data area ϕ(Hq1,Hq2) accessed by the two queries, which can be measured by calculating the size (in bytes) of the set of co-accessed rows.

ϕ(Hq1,Hq2)=D[i]Dj:(lj(q1)D[i,j]uj(q1))(lj(q2)D[i,j]uj(q2))

where the ith row in D is denoted as D[i], with a length of |D[i]|, and D[i,j] as the jth column value of D[i].

2) The second similarity metric evaluates the access distance, denoted as Δ(Hq1,Hq2), between the two queries on each column dimension. This is calculated by finding the maximum difference between column boundaries.

Δ(Hq1,Hq2)=maxi=1z(|ui(q1)ui(q2)|+|li(q1)li(q2)|)2.

Then we can define how to identify queries that satisfy different levels of similarity.

Definition 1 (δ-similar queries). Let δ be a distance threshold. Consider historical load (QH) and future load (QF) with distinct submission timestamps. If we can find two sets QH and QF, for any qhQHQH, there always exists a mapping qhqf(qfQFQF), such as max(Δ(Hqh,Hqf)Δ(Hqh),Δ(Hqh,Hqf)Δ(Hqf)) δ or min(ϕ(Hqh,Hqf)ϕ(Hqh),ϕ(Hqh,Hqf)ϕ(Hqf))1δ, then we deem them as δ-similar. The remaining exploratory queries can be obtained as QE=QFQF. Here, the computation method of Δ(Hq) and ϕ(Hq) is defined as follows:

Δ(Hq)=max1iz|ui(q)li(q)|,

ϕ(Hq)=D[i]D1jz:(lj(q)D[i,j]uj(q)).

2.2 Horizontal partitioning layout

A query-aware data layout P consists of disjoint partitions (R1,...,Rm) created from query samples Q(q1,...,qn). Each partition Ri is materialized as block files with sizes limited to [bmin,2bmin] (bmin=64MB in HDFS [18]). The I/O cost of processing query q over P is evaluated as the total size of accessed partitions, denoted as C(P,q). To build a data layout, recent works [8,9,1214,19] propose a partition tree structure, similar to an index tree, which has been proven effective in guiding data allocation. We give its definition as follows:

Definition 2 (Partition index tree). The partition index tree (T) acts as a router, allocating data to specific partitions. Creating T starts with a root node covering the entire table space, and we select a feasible predicate to split it into multiple child nodes. Each leaf node is then processed sequentially until no more child nodes can be generated. In T, all nodes maintain metadata to guide query skipping and support data routing. Each leaf node (Vi) is ultimately materialized as a single partition (Ri), with a node size |Vi|, which is determined by the number of rows it contains. This node size ranges from [Vmin,Vmax], i.e.,

|Vi|[bminmax(|D[1]||D[n]|),2×bminmax(|D[1]||D[n]|)].

Next, we formally outline the two main problems to be addressed in this paper.

Problem 1 (Robust partitioning). Given a training set QHQF and a test set of QH¯QF¯, each mapping satisfies δ-similarity, we consider two cases: (1) QH is evenly divided into an ordered sequence based on query submission time; (2) Submission time is not recorded, rendering all queries unordered. Our goal is to find an optimal model M that fits the QHQF mapping, enabling it to generate a predicted load QP¯ based on QH¯. This QP¯ can maximize cumulative co-accessed data area with QF¯ while ensuring that a robust data layout built upon QP¯ has the smallest query cost difference versus the layout generated by QF¯, i.e.,

M=argmaxMϕ(M(QH¯|QHF),QF¯),s.t.C(P[M(QH¯)],QF¯)C(P[QF¯],QF¯)is minimized.

Problem 2 (Optimal layout). Given QP, QF, QE, and n relation tables, this problem asks for constructing an individual partition tree Ti for the ith table, followed by routing its table data Di on Ti to create a robust data layout Pi, such that the total I/O cost of executing QE+QF (i.e., QE+F) over all layouts P{P1,...,Pn} is minimized. This process is formulated as follows:

PiRoute(Ti(QP),Di),1in,P=argminPi=1,PiPnC(Pi,QE+F).

2.3 Related work

Data-driven partitioning. In most popular database products such as TiDB [20], ClickHouse [21], and Snowflake [22], partitioning rules based on data distribution are still recommended as the preferred option. This method is suitable for data with prior statistics but requires careful selection of partition boundaries, including hash and range partitioning [4], along with partition maintenance structures like SMA [23], Zone Maps [24]. They exhibit low sensitivity and high adaptability to rapidly changing load scenarios, but with relatively lower performance. Conversely, our LRP ensures both superior performance and minimized partition updates to accommodate dynamic workloads.

Query-driven partitioning. There are two approaches for creating query-driven partitioning rules. Classifier-based methods [7,10] extract representative predicate conditions from historical loads, cluster tuples based on the similarity of their satisfying predicates, and subsequently compute classification features for each cluster. Tree-based methods [8,9,1214,19] incrementally build a partition tree by selecting numeric-type predicates with the maximum query skipping benefit as the split condition at each tree expansion stage. Jigsaw [25] provides optimal data skipping with tetris-shaped partitions managed by logical segments, yet it requires multiple hash tables for frequent tuple reconstruction. Unlike prior studies, LRP is the first tree partitioner to identify additional non-numeric predicates and implement a refined beam search policy for selecting better predicate split combinations.

Adaptive partitioning. To cope with dynamic loads, [11,13,14,19] manage re-partitioning based on the filling of a predefined-length query window, while [8] relies on periodically monitoring distribution differences between new and old data. Learning-based methods [2630] use RNN/RL-style models to monitor and predict query distribution changes, aiding in calculating potential re-partitioning benefits. Another solution is to try creating a robust layout to reduce the re-partitioning frequency. PAW [9] introduces a similar-load scenario and search for a fixed expansion value to fit query changes, which cannot always guarantee high layout quality. LRP addresses this by training prediction networks that incorporate partition evaluation factors into the loss function.

3 LRP overview

3.1 LRP framework

Overview.Fig.2 shows the architecture of LRP framework with a three-stage task.

Step 1 — Query feature embedding. Initially, LRP reads raw queries from real system logs and encodes non-numeric column data using predefined coding rules based on the table schema. The non-numeric columns in the raw table domain are then updated based on the encoding scheme of the corresponding columns. Next LRP extracts complete logical predicate trees (defined in Subsection 4.2) from queries, encodes them, and serializes them into structured query vectors by a level-by-level logical domain computation.

Step 2 — Similarity load prediction. Depending on whether the submission time is known, these query vectors are classified as ordered or unordered. Accordingly, LRP adopts the corresponding network, LSTM for ordered and MLP for unordered, to predict changes in query vectors.

Steps 3–5 — Data layout construction. To create the data layout instance, LRP decodes predicted vectors into logical predicates, followed by a step-by-step construction of the partition tree (T). Starting from the root node, LRP generates the candidate predicate split set for the current node and applies a beam search-based split policy to select the optimal split condition for creating child nodes. The T is incrementally expanded until all leaf nodes meet the partition size requirements. LRP then further reallocates leaf node data by replicating frequently accessed data across multiple partitions for greater data skipping. All table data is routed by T to leaf nodes, which are subsequently materialized into specified partition files.

Query processing. When processing a new query, the query optimizer uses the integrated T to identify required partitions, developing an appropriate query plan and forwarding it to the query executor to obtain the result set.

Example 1 Given a query qh and the domain for table E(c1,c2) in Fig.2, where c1 (numeric column) and c2 (categorical column) respectively use min-max values and distinct values as their raw domain representations. First, LRP will use a dictionary table to encode the country column c2 because it is an enumeration type, i.e., {1:‘USA’, 2:‘China’, 3:‘Germany’, 4:‘Japan’, 5:‘India’}, to obtain a new numeric table domain, i.e., [[1,100],[1,5]]. Subsequently, all predicates are extracted from qh and those related to the encoded columns are processed; for instance, ‘c2in(USA,China,Germany)’ is converted to ‘1c23’. Through logical computation among predicates, LRP obtains the vector representation of qh as Hqh=[[50,100],[1,3]]. Since individual queries lack temporal order, LRP will select the MLP network to generate the vector prediction Hqp=[[45,100],[1,3]], which is then decoded into four predicates (ps1,ps2,ps3,ps4). Employing a split order {ps4ps1ps3ps2} recommended by the tree split policy, LRP constructs a four-layer partition tree (T) comprising five leaf nodes. These leaf nodes are further refined by replicating high-frequency hot data among partitions. Finally, by routing the entire table data through T, five partition files (R1,R2,R3,R4,R5) are created as the final data layout.

3.2 Role of LRP

LRP primarily targets boosting partitioning performance while emphasizing robustness under dynamic loads. It is applicable in the following scenarios:

Static optimal layout. By utilizing more comprehensive predicate features and refining the table space partitioning strategy, LRP functions as an efficient static partitioning algorithm when the load prediction module is disabled.

Complementary to re-partitioning. Although LRP creates robust layouts, it does not eliminate the need for re-partitioning operations to maintain system stability during significant performance declines (e.g., an increase in exploratory queries). Instead, LRP aims to reduce the frequency of re-partitioning and can also work collaboratively with existing re-partitioning methodologies [13,31] rather than conflicting with them.

4 Data encoding and query vector extraction

In this section, we fully leverage non-numeric column data’s access properties to generate complete logical predicates as partitioning features, which can then be embedded into unified query vectors for subsequent load prediction.

4.1 Data encoding phase

There are various types of non-numeric columns in a table schema, including date strings, enumeration data, and other fixed-length and variable-length texts. We design multiple encoding functions to convert them into numeric data, facilitating the computation of their column domains based on the encoded data distribution.

Encoding strategy. As shown in Fig.3①, for a non-numeric column c:

1) If c is of date type, a date formatting function is applied, i.e., D[,c]=TimeStamp(D[,c]).

2) If c is of enumeration type or if distinct(D[,c])D[,c] is less than a constant ϑC, we directly utilize a dictionary encoding strategy by creating a finite-sized dictionary table Dictc for all distinct values (v0,...,vn) in column c, i.e., Dictc[v0]=0,,Dictc[vn]=n.

3) If c has a maximum text length less than the predefined constant ϑL (classified as Complex Type-1), a trie-based index tree [32] is constructed, which is suitable for long and non-categorical strings. The encoding process starts at the root node, identifying the encoding value of the ith letter in a given string from nodes at depth i, ultimately generating a sequence as its encoding key.

4) Otherwise (classified as Complex Type-2), no encoding is performed.

We determine suitable values for the above constants (ϑC=0.1, ϑL=15) through empirical judgment and experimental validation.

Encoding Key Allocation Order. For each encoding column, we avoid allocating continuous dictionary keys to column values in alphabetical order. Instead, our goal is to ensure that the most frequently co-accessed texts are allocated continuous keys, facilitating subsequent domain simplification of non-numeric columns. To achieve this, we first record the co-occurrence frequency of column values referenced by queries, sort them in descending order, and then determine the allocation priority for each column value accordingly. For example, in Fig.2, if the countries ‘USA’ and ‘Germany’ are frequently queried together, we modify the encoding scheme of (‘USA’, ‘China’, ‘Germany’) from (1, 2, 3) to (1, 3, 2).

Domain computation. We then compute the column domains for the encoded table dataset, which supports non-numeric column-based node splitting in partition trees (see Section 6).

1) For numeric and date columns (ci), we can directly use the min/max functions to compute their domain boundaries, i.e., [lci,uci)][min(D[,ci]),max(D[,ci])].

2) For encoded text columns (cj), we adopt the distinct encoding keys as their list-type domain representation, i.e., [lcj,ucj]sorted(distinct(D[,cj])).

3) For unencoded text columns (ck), whose data distribution cannot be effectively quantified, we use ‘None’ as the domain identifier, i.e., [lck,uck][None,None].

Example 2 Fig.3② illustrates how a simple raw table with 4 rows and 5 columns is encoded. For each non-numeric column of c2,,c5, we employ distinct encoding schemes. This includes converting the date column (c2) using a general timestamp function, constructing dictionary tables for the enumeration columns (c3,c4), and building a trie-based tree for the irregular text column (c5). After encoding, the table data consists only of pure numerical values. We then utilize the three domain computation rules to represent the domains for each column, forming the encoded numeric table domain.

4.2 Query encoding phase

Predicate formatting. Given any predicate, we represent it as a triplet (μ,op,ν), consisting of the column (μ), operator (op), and condition value (ν). To our knowledge, previous studies [6,810,14] have predominantly focused on predicate conditions involving comparison operators (e.g., ‘’, ‘’, ‘=’, and ‘’) associated with numeric columns. To support more predicate types, we extract and format all predicates in a query from three parts: special operators, query clauses, and logical relationships.

1) Special operators. To utilize all operators as the partition tree split condition, certain special operators, including set operators and pattern matching operators, must be formatted into unified comparison operators. As shown in Fig.3③, ‘cbetween1and2’ is converted to two predicates: (c,,1) and (c,,2); ‘cin[1,2]’ and ‘cnotin[1,2]’ are easily converted to (c,=,[1,2]) and (c,,[1,2]); if c is a common enumeration column (e.g., gender), ‘clike*mal*’ is converted to (c,=,[Female,Male]) by identifying all column values that satisfy the given wildcard.

2) Query clauses. Fig.3④ designs the formatting rules for three common clauses. In cases involving a Join operation (e.g., two tables E1E2) on the c column, where they share a common join key c, we can distribute all predicate conditions related to c to each table to identify potential predicates. For Group By and Order By operations on the c column, the column data needs to participate in comparison operators such as ‘’ and ‘’ to complete the grouping and sorting of c.

3) Logical relationship (shown in Fig.3⑤). We consider the logical relationships (AND, OR) among single predicates or predicate groups, which is crucial for accurately encoding query vectors and consequently facilitating the search for the queried leaf nodes in partition trees.

Query embedding. For any query, extracting all queried rows from the sampling table so as to calculate its vector representation is expensive; instead, we propose a lightweight query embedding strategy that allows obtaining its vector representation without query pre-execution. This strategy consists of two steps (see Fig.3⑥):

1)Query Encoding. Using extracted predicates and logical relationships, we can construct a logical tree structure (Tlg) (see Fig.3⑦) composed of logical nodes (Vr) and predicate nodes (Vp). Child nodes (including Vp and sub-trees) sharing the same parent node (Vr) imply that they satisfy the corresponding logical relationship (AND, OR). Subsequently, all condition values related to non-numeric columns in Vp are encoded using predefined data encoding schemes.

2) Logical computation. We leverage the encoded Tlg for efficient vector calculations, which are conducted via a bottom-up traversal on Tlg, as depicted in Fig.3⑦. When traversing to each node level, all sibling nodes along with their parent node are grouped into a subtree, and then column domains are computed by performing logical operations among predicates. The process begins with the lowest level subtree (depicted by dashed triangles), sequentially visiting subtrees at higher levels, and cumulatively updating domain values until reaching the root node.

Example 3 As shown in Fig.3⑧, to embed a query q, we first extract a 3-level logical tree Tlgq and then format these predicates with non-numeric condition values or non-comparison operators using the encoding structures (Fig.3①) and predicate conversion rules (Fig.3③ and Fig.3④). Next, logical calculations are performed over Tlgq to obtain the final query vector. Specifically, the subtree Tlgq(Vr(3):Vp(3:4)) at the bottom is executed first in terms of the column c2, i.e., (c2,,1.68×109)(c2,,1.71×109)c2[1.68×109,1.71×109]. Subsequently, the subtree Tlgq(Vr(2:3):Vp(2:4)) is executed over c4. Since c4 is not referenced by the left subtree Tlgq(Vr(3):Vp(3:4)), we can select the c4’s table domain (i.e., [1,2,3]) as its replacement, then (c4,=,[2,3]) (c4,=,[1,2,3])c4[2,3]. Next, we perform Tlgq(Vr(1:2):Vp(1)) over c1, i.e., (c1,<,2)c1[0,2]c1[0,2). For c3 and c5, since they are not involved in any predicates, their domains are also replaced by the corresponding table domains. Finally, we concatenate all domains to generate the final query vector.

Query vector decoding. To create partitions over these vector predictions HQP, we need to pre-decode them into predicate conditions. Considering the one-to-one mapping relationship between HQH and HQP (HQHHQP), we can employ a slot-binding strategy before predicting queries. We first identify boundary values in HQH with equivalent relationships to its corresponding predicate nodes of Tlg, performing slot binding. When decoding vectors, we only fill domain values of HQP into these pre-saved slots to rewrite logical predicates.

5 Predicting similarity-load

In this section, we initially define a loss function to reflect the prediction effectiveness on robust partitioning. Subsequently, we employ two predictive networks to learn the similarity changes of given mappings (HQHHQF), generating HQP.

5.1 Prediction loss incorporating partition semantics

As described in Subsection 4.1, there are three distinct domain types. To achieve a unified model input, we must process the list-type and None-type domains of non-numeric columns (ci) to match the dimensionality of the min–max domain. This ensures consistency in the dimensions of all domains within the query vector.

Vector preprocessing. Before defining the loss function, we process query vectors as follows:

1) If HQHci is a list-type domain, intuitively, predicting its changes is challenging due to its larger number of elements compared to the min–max domain. In contrast, predicting only its domain boundary changes can significantly improve accuracy, with minimal errors in the query predicate features of column ci. As shown in Eq. (1), we use the min–max values of all dictionary keys in HQHci and consider the minimum interval between keys as boundary values to continuous the ci’s discrete domain. This is because these discrete domains are typically derived from query predicates, comprising co-accessed column values, and are more likely to be allocated continuous key values based on the key allocation order mentioned in Subsection 4.1.

HQHci[k0,...,kn][min(k0..n)0.5×|k1k0|,max(k0..n)+0.5×|k1k0|].

where k0,...,kn are obtained keys after encoding ci.

2) If HQHci is a None-type domain, as it does not contain any specific knowledge, we apply a simple numeric encoding, i.e., HQHci[0,0].

Loss function design. We do not directly use the mean square error (MSE) between HQP and HQF as the loss function, as MSE implies that the two query vector values are closely matched, but this does not necessarily mean that the partitions constructed on them have similar quality. Instead, we incorporate both the MSE and partition semantic information to accurately estimate potential partition performance over HQF. Specifically, LRP tailors a function Lwd for reliable loss estimation by using a weighted deviation to account for the impact of different dimension deviations between HQP and HQF on partition quality.

Fig.4 shows all possible intersection positions of Hqf (white rectangle) and Hqp (green rectangle) in a 2D table (i.e., z=2), which is categorized by the number of edges (Neg) in Hqf not covered by Hqp. Intuitively, we formulate it as follows:

Neg=2zi=1z[fo(li(qf),li(qp))+fo(ui(qp),ui(qf))],

where the function fo(x,y)={1,xy0,otherwise.

As Neg increases from 0 to 4, the number of queried areas outside of Hqp increases. Consequently, each Hqf has a higher probability of accessing an extra number of partitions in the Hqp-based data layout, resulting in a higher loss value. Moreover, a smaller intersection area between Hqf and Hqp should also correspond to a higher loss value. Based on these two observations, Lwd is primarily composed of two parts of losses, i.e.,

Lwd(Hqf,Hqp)=Lwd(1)(Hqf,Hqp)+Lwd(2)(Hqf,Hqp)=ϕ(Hqp)ϕ(Hqp,Hqf)ϕ(Hqp)+w2¯ϕ(Hqf)ϕ(Hqf,Hqp)ϕ(Hqf),

where Lwd(1)(Hqf,Hqp) indicates the prediction inaccuracy of qp, affecting the created partition size. Conversely, Lwd(2)(Hqf,Hqp) represents the uncovered area of Hqf, determining Neg as well as the number of future accessed partitions. Intuitively, Lwd(2) should be assigned a higher weight, as increasing the number of accessed partitions has a greater impact on the overall cost than the size of accessed partitions. Thus, we assign weights to them as 1:w2¯ (w2¯>1). The weight w2¯ is dynamic, influenced by the increase in Neg and the ratio γ(qp), which is the query density of qp (denoted as ρqp) relative to the table’s average query density (denoted as ρD). A higher γ(qp) typically indicates that, under the same Neg, more partitions will be accessed in the future. Therefore, we represent w2¯ as 1+λ×γ(qp)×Neg, where λ is a hyper-parameter and γ(qp) is calculated as follows:

γ(qp)=ρqpρD=qiQFϕ(Hqp,Hqi)j=1z(uj(qp)lj(qp))×(qiQFϕ(Hqi)j=1z(ujlj))1.

In this formula, we represent the ρqp by using the proportion of the cumulative query access area within the qp area. The ρD is estimated by the proportion of the cumulative access area of all queries to the total table area.

5.2 Design of load predictor

In this subsection, we first categorize historical query vectors based on their temporal information, and then design suitable prediction networks for each category.

Load classification. We identify query submission times from logs. If HQH is distributed across continuous time intervals, and there exists a query sequence at each interval that satisfies δ-similarity with adjacent sequences, it is referred to as ordered load; otherwise, it is unordered, i.e.,

HQH={[HQh(t1),,HQh(tm)],HQHisordered,[Hqh(1),,Hqh(m)],otherwise,

where HQh(ti)={Hqh(ti,j)|j=1,,n},i=1,,m.

Prediction dimensionality reduction. To facilitate model convergence, we shift from predicting the changes of each domain boundary in the query to predicting the expansion ratio τi (i=1,...,z) for each domain radius. This reduces the model output dimension from 2z to z (50%). During model training, this shift will lead to an increase in Lwd(1) but a decrease in Neg. Nevertheless, since Neg plays a dominant role in reducing the Lwd(2) value, while Lwd(1) has a smaller weight coefficient, the impact on the final Lwd value is limited. Overall, this shift, by reducing the prediction dimensionality, also stabilizes the decrease of Neg, thereby promoting a greater decrease in total loss on the test set.

Model structure. Next, we introduce the load predictor, as showcased in Fig.5, to fit the query changes in training samples HQHHQF, taking HQH as input and generating the output HQP.

[Step 1 Normalization process.] Due to significant differences in the domain ranges across different columns, we first normalize the structured vector Hqh to the 0–1 intervals based on the table domains HD, as follows:

Hqh=Norm(Hqh,HD)=i=1z[li(qh)liuili,ui(qh)liuili].

[Step 2 Estimation process.] Next, for ordered loads, to learn dependency relationships among continuous sequences [HQh(t1),HQh(t),HQh(t+1)], we utilize an LSTM network [17] comprising multiple hidden layers and memory cells, each containing input, forget, and output gate. This enables the network to adaptively store and update long-term dependency information regarding query domain changes. For unordered loads, given the one-to-one mappings Hqh(i)Hqf(i), we directly employ a simple MLP network [16] composed of stacked linear layers with a ReLU activation function to fit the query changing trends. Finally, we use a fully-connected (FC) layer to map the obtained prediction features (O(t+n) or O3(i)) to an expansion ratio τ and apply a Sigmoid activation function to transform the output range to [0,1]. This is then input into the scale-layer to extend Hqh, yielding

Hqp=Scale(Hqh,τ)=i=1z[li(qh)riτi,ui(qh)+riτi],

where ri=12[li(qh)+ui(qh)].

[Step 3 Model training.] The model is trained to aggregate the features of the query vectors with non-linear transformations and iteratively update the network weights Θ to minimize the loss Lwd between HQF and HQP. We use L2 regularization [33] to prevent overfitting. Assuming the optimal network weights are Θ¯, we can obtain Hqp(i) using fMLP(Hqh(i)HQH;Θ¯) or fLSTM([HQh(t,i)]t=1mHQH;Θ¯).

6 Robust partitioning layout construction

In this section, we first present the construction process of a partition tree (a.k.a., logical partition structure) based on obtained query predictions. This process is divided into three phases: parsing, expansion, and optimization. Subsequently, we describe how this tree is integrated into the DBMS.

6.1 Parsing phase: candidate split set generation

To construct a partition tree, we need to pre-generate a candidate split set (denoted as Sss) before each node split. Starting from the root node, we first obtain predicate set (SP) decoded from the vector predictions HQP. If SP=, inspired by KD-Tree [4], we proceed to derive the column median condition set (SM) from the node data distribution; otherwise, we set SM=. The two sets are then merged as the split set for the current node, i.e., Sss={psSP}{msSM}, where ps denotes the predicate-based split and ms denotes the median-based split. At each split, only one element in Sss is selected as the split condition. After splitting this node, we need to reallocate the remaining predicates for each child node, an operation denoted as Reallocate(Sss). Specifically, we filter the feasible predicates for the child nodes, checking whether each predicate can split a given child node into nodes that satisfy the size constraints, and then update the median conditions for each child node.

The use of split conditions. Given a node Vi and the split condition sj(μ,op,ν), we denote the split operation as Split(Vi, sj). If the split column μ is of (encoded) numeric type, we execute the expression ‘μopν’ (where op{=,,>,<,,}) to split the node data. Rows that satisfy the condition are directed to the left child node; otherwise, they are directed to the right. If μ is an unencoded text type, the expression ‘μopν’ represents an invalid split and is treated as a special median split, i.e., we randomly split the node data and evenly distribute it to the two child nodes.

6.2 Expansion phase: beam-search tree split

During the entire tree extension, we aim to find an optimal split order by assigning appropriate split conditions to nodes at each depth, as the current depth’s condition selection impacts both the next depth’s decision and the overall split order of the tablespace.

Basic idea. We adopt a refined greedy policy, beam search (BS), which first explores local split condition combinations of multiple tree depths at each step, and then selects the split condition corresponding to the current depth from the combination with the highest skipping benefit Bskip. Here, Bskip represents the reduction in query cost before and after splitting. This process involves two important parameters: (1) nc represents the number of split condition combinations explored in each decision-making step; (2) hmax is the number of split conditions in each combination, i.e., the additional explored maximal tree depth at the current node.

Beam search-based tree split. Algorithm 1 provides the execution details of BS. We first traverse each available leaf node Vi (initially only the root node) of the partition tree T in turn (line 4). Then, we generate a dynamically changing Sss during the node split process (lines 2,5,16) and sort Sss by Bskip (line 5). Subsequently, the top-nc items in Sss are selected as the exploration conditions, allowing nc distinct candidate split paths for Vi (lines 7–13). Each path recursively executes the same beam search subprocess (denoted as BS(2)) to return a query cost (Csplit) of accessing the current child nodes. Here, each BS(2) will continue to extend more secondary paths (i.e., BS(3)) until BS(hmax) is executed. Next, we choose the current split condition (smin) from the optimal path among the nc paths, splitting Vi into child nodes Vl and Vr, and updating the Sss for them (lines 1417). After splitting, we proceed to the next available leaf node and repeat this process. The algorithm terminates when no more leaf nodes can be split (line 3).

6.3 Optimization phase: node shape refinement via data replication

In horizontal partitioning, row data acts as the minimal construction unit of partition files. This can result in significant query access contention between leaf nodes, especially for dense queries, where different parts of each row are likely to be accessed by different queries. To mitigate the contention, LRP introduces data replication for frequently accessed rows, thereby creating super nodes, which are defined as follows:

Definition 3 (Super Node). Leaf nodes maintain distinct data domains to ensure each row is routed to only one partition. However, when a small batch of rows from one partition is copied into another, this rule is broken, and the modified partition becomes a redundant partition. We refer to its corresponding tree node as a super node, which preserves a primary domain and multiple secondary domains to route non-redundant and redundant data independently.

We denote the replication process of data from node Vi to Vj as ViVj, where Vj is the super node and is assigned an additional secondary domain to maintain the redundant data. When more data is replicated from Vi to Vj, any intersecting secondary domains are merged to save storage space for repeated redundant data. When searching for leaf nodes that match a given query, the primary domain continues to function as the main router. However, adjustments are made to super nodes by effectively utilizing secondary domains, thereby minimizing the number of reads required for leaf nodes.

Example 4 Given a two-dimensional table and a decoded query set QP(q13), Fig.6(a) displays a partition tree T is expanded using the split order {ps1,ps2,ms4,ps3,ms5}, where ps13 and ms45 represent predicate and median splits extracted from QP and the node dataset, respectively. The final T contains six leaf nodes. Specifically in the tablespace, these nodes are instantiated six partitions R58,10,11 (see Fig.6(b)). Based on the primary domain of partitions, q1 needs to access {R8,R10}. If we replicate the green-shaded data area S1 from R8 to R10 (i.e., V8V10) and add the data domain of S1 as the secondary domain of R10 (see Fig.6(c)), then q1 only needs to access R10 because q1’s scanning data in R8 satisfies the secondary domain of R10. Similarly, when creating R10R6, the access plan for q2 can be changed from {R6,R10} to {R6}.

Replicate node creation. We stipulate that a node will be considered for replication if the ratio of redundant data to the entire node’s data is below the set threshold θ. Generally, the more data is replicated, the greater the data skipping benefit Bskip. However, if replicating an equal amount of data does not yield a proportional increase in Bskip (e.g., when the super node serves only a few queries), then this replication operation is not recommended due to its high data maintenance and storage costs with minimal Bskip. To address this trade-off, we first define a function y=ϝ(θ) to represent the change in the ratio of skipping benefit to redundant data proportion as θ increases, i.e.,

y=ϝ(θ)=ws×Bskipwr×|Dr(θ)||D|,

where ws and wr respectively are user-defined weights for Bskip and |Dr(θ)||D| (we set ws:wr=1:1); Dr(θ) is all redundant data when the replication threshold is θ; D denotes the table data. Since we aim to achieve greater skipping benefits with as low θ as possible, we can search for the inflection point of the y-curve’s growth as the optimal θ¯, i.e., solving ϝ(θ,y)=d2ydθ2=0. It signifies that when θ>θ¯, increasing Dr at the same proportion will result in limited growth in Bskip.

Super node selection. We categorize the visited nodes into two groups: replicate nodes (VX) with data to be replicated and candidate nodes (VY) without such redundant data. For each query, after deciding these replicate nodes (satisfying θ¯), we must select which node among the candidate nodes will serve as the super nodes. Generally, to maximize Bskip, the VY with the fewest query accesses is preferred as the super node to avoid additional reads of redundant data for irrelevant accessing queries. Therefore, for each pair of nodes VX and VY, we define a benefit (Ballocate) for VXVY to reflect its allocation priority. The Ballocate arises from the skipped querying node size after replicating, minus the additional read size of redundant data for other irrelevant queries, that is

Ballocate(VXVY)=(|VX||D~r(VY)|)×|QVXVY||DrVXVY|×O(VY),

where VX is the skipped node for all queries QVXVY that benefit from the creation of VXVY; D~r(VY) indicates the existing redundant data in VY before allocation; DrVXVY represents the data replicated from VX to VY; O(VY) represents the number of queries accessing VY.

Next, we select the top element with the highest benefit, VY¯, as the super node (VXVY¯), updating its D~r(VY¯) accordingly, i.e., D~r(VY¯)+=DrVXVY¯. Each VY¯ can be allocated once or multiple times. This allocation process is repeated until positive benefits cannot be obtained.

6.4 Functionality of the LRP partition tree

The LRP partition tree T serves two primary functions: (1) Routing support: Each row in the raw table is processed, matching with leaf nodes via the node metadata of T, and subsequently written to the corresponding partition files (a.k.a., physical partitions). (2) Partition filtering: When a new query is submitted, its encoded logical tree Tlg is executed across all relevant partition trees. The logical predicates at each level of Tlg are examined from bottom to top to identify the satisfied leaf nodes of T. This limits the search space for partition files and assists the query optimizer in formulating an efficient plan.

7 Efficiency analysis of LRP design

In this section, we analyze the execution efficiency of four phases within LRP and compare it with two classical partitioning algorithms (QdTree [8], PAW [9]). The results are shown in Tab.2.

1) Data and query feature extraction: QdTree incurs minimal overhead during predicate feature extraction due to the absence of a query prediction module, thus eliminating the need for query vectorization. In contrast, PAW requires traversing the entire sampling table to pre-execute queries for generating query vectors. LRP, however, only encodes non-numeric columns and performs logical computations, resulting in lower costs compared to PAW.

2) Query prediction: The training time for LRP’s prediction module depends on the number of epochs, introducing additional costs due to weight updates across stacked network layers. Conversely, PAW is faster as it only necessitates multiple iterations of heuristic determinations to incrementally identify the optimal τ.

3) Partition tree construction: The selection of candidate predicates during each tree split process is the primary time bottleneck. Both QdTree and PAW employ a similar greedy strategy. However, LRP explores a larger solution space through beam search in each predicate selection round, potentially leading to exponential growth in time complexity if hmax is not properly set.

4) Data routing: All methods share the same partition instantiation process. To deploy partitions, all row data must be routed through the partition tree and written to specified block addresses. Here, we primarily considered the volume of data written and network factors, without accounting for finer-grained factors such as cache hit rate.

Overall, both PAW and LRP involve query encoding and prediction, leading to higher execution overhead compared to QdTree. When comparing LRP and PAW, LRP incurs lower costs only during the feature extraction phase, but its overall overhead is greater. However, as illustrated in Tab.2②③, this issue can be partially mitigated by setting appropriate hyperparameters (e.g., keeping hmax as small as possible, removing redundant intermediate network layers to reduce Nnet, etc.) or by employing multithreading to build different partition sub-trees in parallel. Furthermore, as shown in Tab.2④, when routing large-scale datasets to create partition files (i.e., NrowNsamples), the time scale differences among the algorithms have a smaller impact on overall system performance.

8 Experiments

In this section, we show the evaluation results of LRP system from three aspects: (1) We compare the scanning ratios of LRP (including its variants) with two classical partitioning algorithms (QdTree, PAW) and the optimal reference for raw and numeric tables under static loads; (2) We separately evaluate the table scanning ratios of different prediction methods combined with the same or different partitioners under δ-similarity loads. We also explore the impact of different mixing ratios of exploratory and similarity queries on partitioner performance; (3) We further evaluate the above experimental configurations in a real Spark cluster.

8.1 Experimental setup

We conduct extensive experiments on a Spark cluster with four computer nodes (3.8 GHz CPUs, 32 GB RAM, 1 TB disk), utilizing HDFS [18] as the underlying file system and Parquet files [34] to store partitioned block data. To speed up data routing, a distributed acceleration framework Ray [35] is utilized to fully leverage the computational resources of each machine. The neural networks (prediction module) are trained on a Tesla P40 GPU with a 24 GB frame buffer.

Dataset and static workload. (1) TPC-H benchmark [15] (50 GB data) comprises 8 tables (62 columns in total) and 22 query templates. (2) TPC-DS [36] (62 GB data) is used to simulate the sales operations of a department store. Compared to TPC-H, it offers a more comprehensive schema, including 7 fact tables and 14 dimension tables (a total of 429 columns), as well as 99 query templates. (3) JOB benchmark [37] is a 13 GB real-world IMDB dataset containing 12 tables (134 columns in total) and 33 query templates. (4) ClickBench [38] (20 GB data) is derived from an actual traffic platform, featuring a large table (105 columns, 100M records) and 43 query templates. Here, query templates are used to generate arbitrary number of static synthetic queries.

Similarity workload. Following PAW [9], we collect 4250 static queries as QH for TPC-H and JOB, and generate the mappings QHQF, named TPC-D and JOB-D, respectively. To be specific, we divide each tablespace into 2z regions and extend each domain of QH with a random distance threshold δuniform(0.8Γ,Γ), where Γ=λ(1+i2z), λ=0.01, and i represents that QH falls within the ith region. For the ordered queries, we set λ[0.005,0.015] and λt. For each benchmark, we randomly shuffle δ-similar queries and split them into training/validation/test sets by 7:2:1.

Evaluation metrics. (1) Access Ratio indicates the ratio of accessed data to the entire table size. (2) Query Latency refers to the end-to-end response time for processing queries. (3) Model Execution Time denotes the time taken to generate logical partition structures, including query parsing, query prediction, and partition tree creation.

Baseline methods.

● QdTree [8] generates the candidate split set using only numeric predicate conditions during the creation of the partition tree. Its split policy consistently selects the predicate condition that maximizes data skipping benefit at each extension step.

● PAW [9] optimizes the division of small nodes in QdTree by merging multiple steps of predicate conditions into a single-step split condition.

● LB-Cost represents a theoretical lower bound cost in an ideal scenario where all accessed data per query is searched and directly written to separate block files.

TH+ is a refined version of PAW [9]’s prediction module. It selects the optimal distance threshold δ¯ from all possible values for extending domain boundaries in QH, aiming to minimize the MSE between QP and QF.

ML+ is an abbreviation for the independent MLP or LSTM predictor component in LRP, aiming to minimize the access ratio when executing QF over QP-based partitions.

● LRP-E and LRP-S are two variants of LRP, each with a component removed to conduct an ablation study. They remove the column encoding and data redundancy modules from LRP, respectively.

8.2 Exp-1: Static scenario experiments

Evaluation on pure numeric tables. We first conduct experiments on pure numeric tables, i.e., by removing all non-numeric column data. Fig.7(a) reveals that LRP-S and LRP-E gains over a 8% and 15.7% reduction in table access ratio than PAW across all benchmarks, respectively. Meanwhile, Fig.8 provides detailed comparisons for representative tables in each benchmark. All cases follow the same performance ranking: LB-Cost>LRP-E(LRP)>LRP-S>PAW>QdTree, where LRP is equivalent to LRP-E since the encoding component is invalid when non-numeric column data is excluded from the raw tables. These results have three-fold reasons: (1) LRP-S outperforms PAW, demonstrating that our beam search policy can identify better split orders among numerous condition combinations; (2) LRP-E outperforms LRP-S, indicating that super nodes help resolve query access contention caused by row-granularity limitations when handling high-density queries on tables like D_wb_sales and D_store_sales; (3) The performance differences between the algorithms are stable and do not vary significantly across different datasets or numerical predicates.

Evaluation on raw tables. When experimenting on raw tables, Fig.7(b) reveals larger performance discrepancies among the algorithms compared to the numeric environments shown in Fig.7(a). Detailed comparisons for specific tables are provided in Fig.9. LRP-S and LRP reduce the data access ratio by 38.9% and 48.8%, respectively, compared to PAW. This reduction is especially pronounced in tables with a high proportion of predicates, such as H_part, D_item, and J_name (average 64%). This is because when more textual predicates (especially those involving costly join operations) are mixed with numeric predicates, LRP and LRP-S are still able to effectively capture the complete query access patterns. In contrast, QdTree and PAW perform worse due to their incomplete query information, rendering numeric predicate-based splits inefficient. This inefficiency is also reflected in their similar table scanning ratios in cases like H_part, D_customer, and J_name, where a large proportion of text predicates are present. Moreover, the result that LRP outperforms LRP-S indicates that LRP’s allowance for data redundancy among partitions further enhances performance on raw tables. These factors contribute to LRP exhibiting more significant reductions in scanning data compared to the other baselines.

Evaluation on model overhead. Regarding the raw tables, Fig.10(a) evaluates the model execution time for each algorithm and their average data routing time when deploying physical partitions. Despite LRP being 17.87 s slower than PAW in logical partition generation due to its additional data encoding and redundancy modules, their cumulative processing time (163 s) is far lower than the data routing time (4.26 h). Thus, this time gap can be considered negligible. Fig.10(b) displays the changes in the data redundant ratio. We observe that SuperNode minimizes block addressing and scanning by replicating a small portion of data. Specifically, LRP yields an average scanning ratio reduction of 22.7% (i.e., 18+15+23+354%) over LRP-S, with only a 0.0034 data redundancy ratio. This is because queries with more textual predicates are more likely to access dispersed, smaller amounts of data, thereby promoting the Bskip of redundant partitions.

Evaluation on model adaptability. We also conduct model sensitivity tests with varying workload configuration. Fig.11(a) generates TPC-H queries ranging from 50 to 1000, increasing the query density and requiring finer partitions. LRP initially achieves a 6.7% performance improvement over PAW, eventually reaching 29%. Fig.11(b) explores the impact of maximum query range: an increasing queried range results in higher access ratios and a larger performance gap between LRP and PAW. However, at 0.3 or above, the quality of all layouts begins to deteriorate until reaching 1, at which point the quality of all layouts become consistent, with their queried data being stored in the same number of blocks. Fig.11(c) restricts the maximum number of columns accessed by each query. As shown, the more columns involved in the predicates, the fewer the average rows typically meet the conditions, resulting in a smaller proportion of data scanned. This does not affect predicate selection during partition tree construction, thus having minimal impact on algorithm performance. Fig.11(d) explores two extreme query distributions. Under uniform distribution, each algorithm effectively partitions rows to accommodate each query, with LRP performing best. When queries are densely distributed, some table areas cannot be effectively partitioned due to contention among queries, reducing the performance of all algorithms. However, LRP still maintains the lowest access ratio, benefiting from data replication.

8.3 Exp-2: δ-similarity scenario experiments

In addition to the provided TH+ and ML+ predictors, we establish two additional prediction baselines: N/A, representing no prediction, and OPT, referring to the predicted load satisfying QP=QF.

Evaluation on prediction approaches.Fig.12 displays the QP-based layout performance for various prediction methods, along with the performance deviation (shaded areas) across single-table layouts. Regardless of whether the same partitioner (e.g., LRP) or distinct partitioners (i.e., QdTree, PAW, LRP, and LB-Cost) are used to create layouts based on the given predictions, our ML+ predictor always outperforms the best alternative, achieving performance improvements of 46% and 72% compared to TH+ in unordered and ordered loads, respectively. Additionally, ML+ shows narrower shaded boundaries, indicating better optimization for each sub-layout. Moreover, compared to ordered loads, TH+ proves less effective in unordered loads with fewer historical queries, while ML+ exhibits greater stability.

Case study.Fig.13(a)–Fig.13(c) and Fig.13(d)–Fig.13(f) present 3D visualizations of query prediction cases from TPC-D and JOB-D, respectively, across dimensions c1, c2, and c3. In Fig.13(a) and Fig.13(d), a close-up of a randomly selected query is shown, with the coordinate system adjusted for clarity. Blue and red boxes represent the HQP generated by TH+ and ML+, respectively, while the green boxes depict the actual HQF. In Fig.13(a), although TH+’s HQP aligns more closely with HQF compared to ML+, i.e., Lmse(TH+)<Lmse(ML+), ML+ achieves a smaller prediction loss (Lwd(ML+)<Lwd(TH+)) due to its Neg value being 0. Intuitively, if the red box is used as the basis for constructing partition files, then for queries requiring scanning the green box, only one file needs to be accessed to complete the task. Similarly, in Fig.13(d), the Neg of ML+ is 0, which is lower than the 4 of TH+. This suggests that ML+ predicts domain values that are slightly larger than the real HQF, making robust partitioning feasible.

Fig.13(b)–Fig.13(c) and Fig.13(e)–Fig.13(f) randomly select six and eight query cases, respectively, with their prediction metrics compared in Tab.3. In all cases, partitions based on ML+’s predictions consistently yield fewer scanned blocks (ML+:TH+45<198) and lower data scan ratios (0.0045<0.018). This can be attributed to ML+’s smaller Neg (0.36<4.86), leading to a smaller Lwd value (0.79<8.45), irrespective of the Lmse value (3×104>9×105).

Evaluation on the exploratory queries.Fig.14 shows that LRP mitigates performance degradation when more exploratory queries are mixed into TPC-D and JOB-D test queries. As the random percentage increases from 0% to 85%, the access ratio gap between PAW and QdTree expands from 2.2 to 13.1 in TPC-D, and from 1.4 to 10.8 in JOB-D. Similarly, the access ratio gap between LRP and PAW expands from 1.2 to 9 in TPC-D and from 0.5 to 3.5 in JOB-D. This improvement is attributed to PAW/LRP’s use of data distribution-based median splits, which enhances their layout adaptability to exploratory queries compared to QdTree’s sole use of predicate splits. Moreover, LRP’s consideration of median splits for encoded non-numeric column data further boosts its adaptability.

8.4 Exp-3: efficiency analysis in Spark cluster

There is a positive correlation between the access ratio and query latency. Reducing the data access ratio typically results in decreased query latency due to fewer accessed partition files, less scanned metadata and column data on disk, and less frequent merging of scan results.

To assess the quality of constructed layouts, we utilized Spark-SQL, averaging five executions for each indicator measurement. Tab.4 presents statistical data such as averages, minimums, percentiles, regarding query latency. Our findings are as follows. (1) Static Scenarios: For both numeric and raw datasets, LRP consistently outperforms other baselines across various statistical indicators. The optimizations in the partition tree and the complete consideration of logical predicates make LRP achieve an average query latency reduction of 15.9% and 47.5% for the two types of datasets, respectively, compared to PAW. LRP exhibits more significant performance improvements in raw tables, similar to the trends observed in Fig.7(b). By effectively leveraging text-type predicates as partitioning features to partition data, LRP accelerates filter operations, especially in queries with a higher proportion of non-numeric columns. (2) Dynamic Scenarios: In TPC-D and JOB-D, LRP consistently achieves greater performance improvements compared to static scenarios, attributable to its robust layout supported by predictive networks with a novel loss function. This function allows ML+ to establish a positive correlation between prediction loss and layout robustness while minimizing prediction loss. Compared to PAW, LRP achieves query response times that are 2.49× faster (5.13s2.06s) on TPC-D and 3.17× faster (1.70s0.54s) on JOB-D.

9 Conclusion and future work

In this paper, we propose an end-to-end LRP system. LRP begins by designing a comprehensive data and query encoding scheme to extract valuable query access patterns across both numeric and non-numeric column data types. It then trains MLP/LSTM networks to minimize loss values by integrating partition semantic information, thereby predicting future query pattern shifts. Finally, it implements beam search-based tree-splitting and node redundancy policies to generate a partition tree, which can materialize a robust data layout tailored to predicted query patterns. Experimental results on Spark demonstrate that our method significantly outperforms existing solutions, achieving a 49.20% reduction in query latency for static queries and a 64.15% reduction in dynamic environments.

However, some aspects of LRP still require further refinement, which we plan to address in future work. (1) Join Optimization: When selecting tree-splitting conditions, we did not fully consider the cost of data shuffling, which often exceeds scanning costs. Therefore, predicate and median conditions related to Join columns should be weighted more heavily to reduce shuffling overhead. (2) Load Balancing: Although table data has been allocated to partition files, the distribution of these files across machines needs further optimization. Future work will focus on optimizing the physical placement of files based on access frequency and other factors to ensure load balancing and maximize the utilization of cluster resources.

References

[1]

Taylor R W, Sacca D, Wiederhold G . Database partitioning in a cluster of processors. ACM Transactions on Database Systems (TODS), 1985, 10( 1): 29–56

[2]

Copeland G, Alexander W, Boughter E, Keller T. Data placement in bubba. In: Proceedings of 1988 ACM SIGMOD International Conference on Management of Data. 1988, 99−108

[3]

Stöhr T, Märtens H, Rahm E. Multi-dimensional database allocation for parallel data warehouses. In: Proceedings of the 26th International Conference on Very Large Data Bases. 2000, 273−284

[4]

Bentley J L . Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18( 9): 509–517

[5]

Zhan C, Su M, Wei C, Peng X, Lin L, Wang S, Chen Z, Li F, Pan Y, Zheng F, Chai C . AnalyticDB: real-time OLAP database system at alibaba cloud. Proceedings of the VLDB Endowment, 2019, 12( 12): 2059–2070

[6]

Papadomanolakis S, Ailamaki A. AutoPart: automating schema design for large scientific databases using data partitioning. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management. 2004, 383−392

[7]

Sun L, Franklin M J, Krishnan S, Xin R S. Fine-grained partitioning for aggressive data skipping. In: Proceedings of 2014 ACM SIGMOD International Conference on Management of Data. 2014, 1115−1126

[8]

ang Z, Chandramouli B, Wang C, Gehrke J, Li Y, Minhas U F, Larson P Å, Kossmann D, Acharya R. Qd-tree: learning data layouts for big data analytics. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 193−208

[9]

Li Z, Yiu M L, Chan T N. PAW: data partitioning meets workload variance. In: Proceedings of the 38th IEEE International Conference on Data Engineering. 2022, 123−135

[10]

Sun L, Franklin M J, Wang J, Wu E . Skipping-oriented partitioning for columnar layouts. Proceedings of the VLDB Endowment, 2016, 10( 4): 421–432

[11]

Li C, Markl V, Aly A M, Mahmood A R, Hassan M S, Aref W G, Ouzzani M, Elmeleegy H, Qadah T . AQWA: adaptive query workload aware partitioning of big spatial data. Proceedings of the VLDB Endowment, 2015, 8( 13): 2062–2073

[12]

Aly A M, Elmeleegy H, Qi Y, Aref W. Kangaroo: workload-aware processing of range data and range queries in hadoop. In: Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016, 397−406

[13]

Lu Y, Shanbhag A, Jindal A, Madden S . AdaptDB: adaptive partitioning for distributed joins. Proceedings of the VLDB Endowment, 2017, 10( 5): 589–600

[14]

Ding J, Minhas U F, Chandramouli B, Wang C, Li Y, Li Y, Kossmann D, Gehrke J, Kraska T. Instance-optimized data layouts for cloud analytics workloads. In: Proceedings of 2021 International Conference on Management of Data. 2021, 418−431

[15]

TPC-H benchmark. See tpc.org/tpch/ website, 1999.

[16]

Rosenblatt F . Principles of neurodynamics: perceptrons and the theory of brain mechanisms. The American Journal of Psychology, 1963, 76( 4): 705–707

[17]

Hochreiter S, Schmidhuber J . Long short-term memory. Neural Computation, 1997, 9( 8): 1735–1780

[18]

Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST). 2010, 1−10

[19]

Shanbhag A, Jindal A, Madden S, Quiane J, Elmore A J. A robust partitioning scheme for ad-hoc query workloads. In: Proceedings of 2017 Symposium on Cloud Computing. 2017, 229−241

[20]

Huang D, Liu Q, Cui Q, Fang Z, Ma X, Xu F, Shen L, Tang L, Zhou Y, Huang M, Wei W, Liu C, Zhang J, Li J, Wu X, Song L, Sun R, Yu S, Zhao L, Cameron N, Pei L, Tang X . TIDB: a raft-based HTAP database. Proceedings of the VLDB Endowment, 2020, 13( 12): 3072–3084

[21]

ClickHouse: an open-source columnar database management system. See clickhouse.com/docs/en/observability/managing-data website, 2016

[22]

Dageville B, Cruanes T, Zukowski M, Antonov V, Avanes A, Bock J, Claybaugh J, Engovatov D, Hentschel M, Huang J S, Lee A W, Motivala A, Munir A Q, Pelley S, Povinec P, Rahn G, Triantafyllis S, Unterbrunner P. The snowflake elastic data warehouse. In: Proceedings of 2016 International Conference on Management of Data. 2016, 215−226

[23]

Moerkotte G. Small materialized aggregates: a light weight index structure for data warehousing. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 476−487

[24]

Graefe G. Fast loads and fast queries. In: Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery. 2009, 111−124

[25]

Kang D, Jiang R, Blanas S. Jigsaw: a data storage and query processing engine for irregular table partitioning. In: Proceedings of 2021 International Conference on Management of Data. 2021, 898−911

[26]

han A, Yan X, Tao S, Anerousis N. Workload characterization and pre diction in the cloud: a multiple time series approach. In: Proceedings of 2012 IEEE Network Operations and Management Symposium. 2012, 1287−1294

[27]

Pavlo A, Angulo G, Arulraj J, Lin H, Lin J, Ma L, Menon P, Mowry T C, Perron M, Quah I, Santurkar S, Tomasic A, Toor S, Van Aken D, Wang Z, Wu Y, Xian R, Zhang T. Self-driving database management systems. In: Proceedings of the 8th Biennial Conference on Innovative Data Systems Research. 2017, 1

[28]

Ma L, Van Aken D, Hefny A, Mezerhane G, Pavlo A, Gordon G J. Query-based workload forecasting for self-driving database management systems. In: Proceedings of 2018 International Conference on Management of Data. 2018, 631−645

[29]

Hilprecht B, Binnig C, Röhm U. Learning a partitioning advisor for cloud databases. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 143−157

[30]

Zhou X, Li G, Feng J, Liu L, Guo W . Grep: a graph learning based database partitioning system. Proceedings of the ACM on Management of Data, 2023, 1( 1): 94

[31]

Jindal A, Dittrich J. Relax and let the database do the partitioning online. In: Proceedings of the 5th International Workshop on Business Intelligence for the Real-Time Enterprise. 2011, 65−80

[32]

Wang J, Chai C, Liu J, Li G . Face: a normalizing flow based cardinality estimator. Proceedings of the VLDB Endowment, 2021, 15( 1): 72–84

[33]

Hoerl A E, Kennard R W . Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 1970, 12( 1): 55–67

[34]

Bertino E, Atzeni P, Tan K L, Chen Y, Tay Y C, Melnik S, Gubarev A, Long J J, Romer G, Shivakumar S, Tolton M, Vassilakis T . Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 2010, 3( 1–2): 330–339

[35]

Ray: an open source framework to build and scale your ML and Python applications. See docs.ray.io/en/latest/ website, 2017

[36]

TPC-DS benchmark. See www.tpc.org/tpcds/ website, 2005

[37]

JOB benchmark. See developer.imdb.com/non-commercial-datasets/ website, 2016

[38]

ClickBench benchmark. See github.com/ClickHouse/ClickBench website, 2019

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap
PDF (4718KB)

Supplementary files

Highlights

1502

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/