OCACO: an operator-level cardinality and cost joint estimator

Tao JI; Haoyang LI; Kai ZHONG; Jing ZHANG; Cuiping LI; Hong CHEN

doi:10.1007/s11704-025-50348-6

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (9) :2009349 DOI: 10.1007/s11704-025-50348-6

Artificial Intelligence

RESEARCH ARTICLE

OCACO: an operator-level cardinality and cost joint estimator

Tao JI ¹^,²
, Haoyang LI ¹^,²
, Kai ZHONG ¹^,²
, Jing ZHANG ¹^,²
, Cuiping LI ¹^,²
, Hong CHEN ¹^,³

Author information +

History +

PDF (2259KB)

Abstract

Cardinality and cost estimation are critical components of query optimization, as they directly influence the construction of efficient physical execution plans. While machine learning-based estimators have achieved notable success, they face several challenges: (1) Training data derived from rigid, template-driven benchmarks exhibits significant distributional divergence from real-world query workloads, a challenge further compounded by the manual template design in exhaustively representing the full spectrum of query patterns. (2) These methods demonstrate limited generalization capabilities, especially in scenarios involving sub-plan estimation or queries that significantly deviate from the training query templates. Furthermore, the inherent inefficiency of operator-level cardinality estimation frequently undermines its applicability for accurate cost estimation. (3) These approaches frequently fail to leverage the rich semantic information and dynamic dependencies between operators.

To address these challenges, we propose a novel operator-level cardinality and cost estimator that simultaneously estimates the cardinality and cost of all sub-plans within a query plan. First, we leverage large language models to generate high-quality and diverse SQL queries, which serve as the foundation for pre-training and fine-tuning our model. Second, we introduce a semantic-based operator encoding strategy, augmented with a novel tree-structure-aware neural network, to effectively represent each sub-plan. Third, we propose a specialized loss function tailored for joint cardinality and cost prediction at the operator level, fully utilizing labels from each sub-plan. Extensive experiments on both synthetic and real-world datasets demonstrate that our method consistently outperforms state-of-the-art approaches.

Graphical abstract

Keywords

cardinality / cost / machine learning / AI4DB

Cite this article

Download citation ▾

Tao JI, Haoyang LI, Kai ZHONG, Jing ZHANG, Cuiping LI, Hong CHEN. OCACO: an operator-level cardinality and cost joint estimator. Front. Comput. Sci., 2026, 20(9): 2009349 DOI:10.1007/s11704-025-50348-6

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Sun J, Zhang J, Sun Z, Li G, Tang N . Learned cardinality estimation: a design space exploration and a comparative evaluation. Proceedings of the VLDB Endowment, 2021, 15( 1): 85–97

[2]	Kim K, Jung J, Seo I, Han W S, Choi K, Chong J. Learned cardinality estimation: an in-depth study. In: Proceedings of 2022 International Conference on Management of Data. 2022, 1214–1227

[3]	Lohman G M. Is query optimization a ‘solved’ problem? ACM SIGMOD Blog, 2014

[4]	Wang X, Qu C, Wu W, Wang J, Zhou Q . Are we ready for learned cardinality estimation?. Proceedings of the VLDB Endowment, 2021, 14( 9): 1640–1654

[5]	Hilprecht B, Schmidt A, Kulessa M, Molina A, Kersting K, Binnig C . DeepDB: learn from data, not from queries!. Proceedings of the VLDB Endowment, 2020, 13( 7): 992–1005

[6]	Li Y, Wang H, Liu X . One seed, two birds: a unified learned structure for exact and approximate counting. Proceedings of the ACM on Management of Data, 2024, 2( 1): 15

[7]	Yang Z, Liang E, Kamsetty A, Wu C, Duan Y, Chen X, Abbeel P, Hellerstein J M, Krishnan S, Stoica I . Deep unsupervised cardinality estimation. Proceedings of the VLDB Endowment, 2019, 13( 3): 279–292

[8]	Kipf A, Kipf T, Radke B, Leis V, Boncz P A, Kemper A. Learned cardinalities: estimating correlated joins with deep learning. In: Proceedings of the 9th Biennial Conference on Innovative Data Systems Research. 2019

[9]	Li P, Wei W, Zhu R, Ding B, Zhou J, Lu H . ALECE: an attention-based learned cardinality estimator for SPJ queries on dynamic workloads. Proceedings of the VLDB Endowment, 2023, 17( 2): 197–210

[10]	Sun J, Li G . An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment, 2019, 13( 3): 307–319

[11]	Yang J, Wu S, Zhang D, Dai J, Li F, Chen G . Rethinking learned cost models: why start from scratch?. Proceedings of the ACM on Management of Data, 2023, 1( 4): 255

[12]	Li Y, Wang L, Wang S, Sun Y, Peng Z. A resource-aware deep cost model for big data query processing. In: Proceedings of the 38th IEEE International Conference on Data Engineering (ICDE). 2022, 885–897

[13]	Marcus R, Papaemmanouil O . Plan-structured deep neural network models for query performance prediction. Proceedings of the VLDB Endowment, 2019, 12( 11): 1733–1746

[14]	Negi P, Wu Z, Kipf A, Tatbul N, Marcus R, Madden S, Kraska T, Alizadeh M . Robust query driven cardinality estimation under changing workloads. Proceedings of the VLDB Endowment, 2023, 16( 6): 1520–1533

[15]	Han Y, Wu Z, Wu P, Zhu R, Yang J, Tan L W, Zeng K, Cong G, Qin Y, Pfadler A, Qian Z, Zhou J, Li J, Cui B . Cardinality estimation in DBMS: a comprehensive benchmark evaluation. Proceedings of the VLDB Endowment, 2021, 15( 4): 752–765

[16]	Leis V, Gubichev A, Mirchev A, Boncz P, Kemper A, Neumann T . How good are query optimizers, really?. Proceedings of the VLDB Endowment, 2015, 9( 3): 204–215

[17]	Zhao Y, Cong G, Shi J, Miao C . QueryFormer: a tree transformer model for query plan representation. Proceedings of the VLDB Endowment, 2022, 15( 8): 1658–1670

[18]	Hilprecht B, Binnig C . Zero-shot cost models for out-of-the-box learned cost prediction. Proceedings of the VLDB Endowment, 2022, 15( 11): 2361–2374

[19]	Ortiz J, Balazinska M, Gehrke J, Keerthi S S. An empirical analysis of deep learning for cardinality estimation. 2019, arXiv preprint arXiv: 1905.06425

[20]	Marcus R, Negi P, Mao H, Tatbul N, Alizadeh M, Kraska T. Bao: making learned query optimization practical. In: Proceedings of 2021 International Conference on Management of Data. 2021, 1275–1288

[21]	Akdere M, Çetintemel U, Riondato M, Upfal E, Zdonik S B. Learning-based query performance modeling and prediction. In: Proceedings of 2012 IEEE 28th International Conference on Data Engineering. 2012, 390–401

[22]	Zhu R, Chen W, Ding B, Chen X, Pfadler A, Wu Z, Zhou J . Lero: a learning-to-rank query optimizer. Proceedings of the VLDB Endowment, 2023, 16( 6): 146–1479

[23]	Nambiar R, Wakou N, Carman F, Majdalany M. Transaction processing performance council (TPC): state of the council 2010. In: Proceedings of the 2nd TPC Technology Conference on Performance Evaluation and Benchmarking. 2010, 1–9

[24]	Dauphin Y, Fan A, Auli M, Grangier D. Language modeling with gated convolutional networks. In: Proceedings of the 34th International Conference on Machine Learning. 2016, 933–941

[25]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000–6010

[26]	Dutt A, Wang C, Nazi A, Kandula S, Narasayya V, Chaudhuri S . Selectivity estimation for range predicates using lightweight models. Proceedings of the VLDB Endowment, 2019, 12( 9): 1044–1057

[27]	Park Y, Zhong S, Mozafari B. QuickSel: quick selectivity learning with mixture models. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 1017–1033

[28]	Zhao K, Yu J X, He Z, Li R, Zhang H. Lightweight and accurate cardinality estimation by neural network Gaussian process. In: Proceedings of 2022 International Conference on Management of Data. 2022, 973–987

[29]	Zhang J, Zhang C, Li G, Chai C . PACE: poisoning attacks on learned cardinality estimation. Proceedings of the ACM on Management of Data, 2024, 2( 1): 37

[30]	Reiner S, Grossniklaus M . Sample-efficient cardinality estimation using geometric deep learning. Proceedings of the VLDB Endowment, 2023, 17( 4): 740–752

[31]	Yang Z, Kamsetty A, Luan S, Liang E, Duan Y, Chen X, Stoica I . NeuroCard: one cardinality estimator for all tables. Proceedings of the VLDB Endowment, 2020, 14( 1): 61–73

[32]	Kim K, Lee S, Kim I, Han W S . ASM: harmonizing autoregressive model, sampling, and multi-dimensional statistics merging for cardinality estimation. Proceedings of the ACM on Management of Data, 2024, 2( 1): 45

[33]	Aytimur M, Reiner S, Wörteler L, Chondrogiannis T, Grossniklaus M . LPLM: a neural language model for cardinality estimation of like-queries. Proceedings of the ACM on Management of Data, 2024, 2( 1): 54

[34]	Selinger P G, Astrahan M M, Chamberlin D D, Lorie R A, Price T G. Access path selection in a relational database management system. In: Proceedings of 1979 ACM SIGMOD International Conference on Management of Data. 1979, 23–34

[35]	Neumann T. Query simplification: graceful degradation for join-order optimization. In: Proceedings of 2009 ACM SIGMOD International Conference on Management of Data. 2009, 403–414

[36]	Chen X, Zhu R, Ding B, Wang S, Zhou J . Lero: applying learning-to-rank in query optimizer. The VLDB Journal, 2024, 33( 5): 1307–1331

[37]	Li H, Wang C, Liu Z. Table embedding models based on contrastive learning for improved cardinality estimation. In: Proceedings of the 8th International Joint Conference on Web and Big Data. 2024, 494–511

[38]	Yan Y, Wang H, Huang J, Zhong D, Yu T, Zhang K, Yang M, Wang T. QCFE: an efficient feature engineering for query cost estimation. In: Proceedings of 2024 IEEE 40th International Conference on Data Engineering (ICDE). 2023, 4302–4315

[39]	Liang Z, Chen X, Xia Y, Ye R, Chen H, Xie J, Zheng K. DACE: a database-agnostic cost estimator. In: Proceedings of the 40th IEEE International Conference on Data Engineering (ICDE). 2024, 4925–4937

[40]	Marcus R, Papaemmanouil O. Deep reinforcement learning for join order enumeration. In: Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 2018, 3

[41]	Yu X, Li G, Chai C, Tang N. Reinforcement learning with tree-LSTM for join order selection. In: Proceedings of the 36th IEEE International Conference on Data Engineering (ICDE). 2020, 1297–1308

[42]	Marcus R, Negi P, Mao H, Zhang C, Alizadeh M, Kraska T, Papaemmanouil O, Tatbul N . Neo: a learned query optimizer. Proceedings of the VLDB Endowment, 2019, 12( 11): 1705–1718

[43]	Kang J K Z, Gaurav, Tan S Y, Sun S, He B. Efficient deep learning pipelines for accurate cost estimations over large scale query workload. In: Proceedings of 2021 International Conference on Management of Data. 2021, 1014–1022

[44]	Achiam O, Adler S, Agarwal S, Ahmad L, Akkaya L, , . GPT-4 technical report. 2023, arXiv preprint arXiv: 2303.08774

[45]	Gemini Team, Google. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. 2024, arXiv preprint arXiv: 2403.05530

[46]	PostgreSQL Core Team. Documentation PostgreSQL 12, explain. 2024, available at the website of postgresql.org/docs/16/sql-explain.html

[47]	Shetiya S, Thirumuruganathan S, Koudas N, Das G . Astrid: accurate selectivity estimation for string predicates using deep learning. Proceedings of the VLDB Endowment, 2020, 14( 4): 471–484

[48]	Hua W, Dai Z, Liu H, Le Q. Transformer quality in linear time. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 9099–9117

[49]	So D R, Mańke W, Liu H, Dai Z, Shazeer N, Le Q V. Primer: searching for efficient transformers for language modeling. 2021, arXiv preprint arXiv: 2109.08668

[50]	Agarap A F. Deep learning using rectified linear units (ReLU). 2018, arXiv preprint arXiv: 1803.08375

[51]	Hui B, Yang J, Cui Z, Yang J, Liu D, Zhang L, Liu T, Zhang J, Yu B, Lu K, Dang K, Fan Y, Zhang Y, Yang A, Men R, Huang F, Zheng B, Miao Y, Quan S, Feng Y, Ren X, Ren X, Zhou J, Lin J. Qwen2.5-coder technical report. 2024, arXiv preprint arXiv: 2409.12186

[52]	Ding B, Chaudhuri S, Gehrke J, Narasayya V . DSB: a decision support benchmark for workload-driven and traditional database systems. Proceedings of the VLDB Endowment, 2021, 14( 13): 3376–3388

[53]	Poss M, Smith B, Kollar L, Larson P. TPC-DS, taking decision support benchmarking to the next level. In: Proceedings of 2002 ACM SIGMOD International Conference on Management of Data. 2002, 582–587

[54]	Song K, Tan X, Qin T, Lu J, Liu T Y. MPNet: masked and permuted pre-training for language understanding. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1414

[55]	Moerkotte G, Neumann T, Steidl G . Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment, 2009, 2( 1): 982–993

[56]	van der Maaten L, Hinton G . Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, 9( 86): 2579–2605

RIGHTS & PERMISSIONS

Higher Education Press

PDF (2259KB)

Part of a collection:

Supplementary files

Highlights

382

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submisson

Call for papers

Guidelines for authors

Download templates