A survey of estimating number of distinct values

Jiajun LI; Runlin LEI; Zhewei WEI

doi:10.1007/s11704-024-40952-3

PDF(1195 KB)

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (9) : 199611. DOI: 10.1007/s11704-024-40952-3

Information Systems

REVIEW ARTICLE

A survey of estimating number of distinct values

Author information +

History +

Abstract

Estimating the Number of Distinct Values (NDVs) is a critical task in the fields of databases and data streams. Over time, various algorithms for estimating NDVs have been developed, each tailored to different requirements for time, I/O, and accuracy. These algorithms can be broadly categorized into two main types: sampling-based and sketch-based. Sampling-based NDV algorithms improve efficiency by sampling rather than accessing all items, often at the cost of reduced accuracy. In contrast, sketch-based NDV algorithms maintain a compact sketch using hashing to scan the entire dataset, typically offering higher accuracy but at the expense of increased I/O costs. When dealing with large-scale data, scanning the entire table may become infeasible. Thus, the challenge of efficiently and accurately estimating NDVs has persisted for decades. This paper provides a comprehensive review of the fundamental concepts, key techniques, and a comparative analysis of various NDV estimation algorithms. We first briefly examine traditional estimators in chronological order, followed by an in-depth discussion of the newer estimators developed over the past decade, highlighting the specific scenarios in which they are applicable. Furthermore, we illustrate how NDV estimation algorithms have been adapted to address the complexities of modern real-world data environments effectively. Despite significant progress in NDV estimation research, challenges remain in terms of theoretical scalability and practical application. This paper also explores potential future directions, including block sampling NDV estimation, learning-based NDV estimation, and their implications for database applications.

Graphical abstract

Keywords

number of distinct values / database / large-scale

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Jiajun LI, Runlin LEI, Zhewei WEI. A survey of estimating number of distinct values. Front. Comput. Sci., 2025, 19(9): 199611 https://doi.org/10.1007/s11704-024-40952-3

This is a preview of subscription content, contact us for subscripton.

Jiajun LI is a PhD candidate at Gaolin School of Artificial Intelligence, Renmin University of China, China advised by Professor Zhewei Wei. He received his BE degree at School of Statistics, Renmin University of China, China in 2019. His research focuses on the approximate computing algorithm, AI for databases (AI4DB), and the design of data stream and sketch algorithms

Runlin LEI is a PhD candidate at Gaolin School of Artificial Intelligence, Renmin University of China, China advised by Professor Zhewei Wei. He received his BE degree at School of Information and Technology, Shanghai University of Finance and Economics, China in 2022. His research focuses on graph neural networks and AI for databases

Zhewei WEI is currently a professor at Gaoling School of Artificial Intelligence, Renmin University of China, China. He obtained his PhD degree at Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (HKUST), China in 2012. He received the BSc degree in the School of Mathematical Sciences at Peking University, China in 2008. His research interests include graph algorithms, massive data algorithms, and streaming algorithms. He was the Proceeding Chair of SIGMOD/PODS2020 and ICDT2021, the Area Chair of ICML 2022/2023, NeurIPS 2022/2023, ICLR 2023, WWW 2023. He is also the PC member of various top conferences, such as VLDB, KDD, ICDE, ICML, and NeurIPS

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Fisher R A, Corbet A S, Williams C B . The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology, 1943, 12( 1): 42–58

[2]	Efron B, Thisted R . Estimating the number of unseen species: how many words did Shakespeare know. Biometrika, 1976, 63( 3): 435–447

[3]	Valiant G, Valiant P. Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. In: Proceedings of the 43rd Annual ACM Symposium on Theory of Computing. 2011, 685–694

[4]	Hou W C, Ozsoyoglu G, Taneja B K. Processing aggregate relational queries with hard time constraints. In: Proceedings of 1989 ACM SIGMOD International Conference on Management of Data. 1989, 68–77

[5]	Ozsoyoglu G, Du K, Tjahjana A, Hou W C, Rowland D Y. On estimating count, sum, and average relational algebra queries. In: Proceedings of the International Conference on Database and Expert Systems Applications. 1991, 406–412

[6]	Haas P J, Naughton J F, Seshadri S, Stokes L. Sampling-based estimation of the number of distinct values of an attribute. In: Proceedings of the 21st International Conference on Very Large Data Bases. 1995, 311–322

[7]	Lemire D, Kaser O . Reordering columns for smaller indexes. Information Sciences, 2011, 181( 12): 2550–2570

[8]	Li P, Wei W, Zhu R, Ding B, Zhou J, Lu H . ALECE: an attention-based learned cardinality estimator for SPJ queries on dynamic workloads. Proceedings of the VLDB Endowment, 2023, 17( 2): 197–210

[9]	Chabchoub Y, Chiky R, Dogan B . How can sliding HyperLogLog and EWMA detect port scan attacks in IP traffic. EURASIP Journal on Information Security, 2014, 2014( 1): 5

[10]	Cohen R, Nezri Y . Cardinality estimation in a virtualized network device using online machine learning. IEEE/ACM Transactions on Networking, 2019, 27( 5): 2098–2110

[11]	Clemens V, Schulz L C, Gartner M, Hausheer D. DDoS detection in P4 using HYPERLOGLOG and COUNTMIN sketches. In: Proceedings of NOMS 2023-2023 IEEE/IFIP Network Operations and Management Symposium. 2023, 1–6

[12]	Kalai A T, Vempala S S. Calibrated language models must hallucinate. In: Proceedings of the 56th Annual ACM Symposium on Theory of Computing. 2024, 160–171

[13]	Bunge J, Fitzpatrick M . Estimating the number of species: a review. Journal of the American Statistical Association, 1993, 88( 421): 364–373

[14]	Harmouch H, Naumann F . Cardinality estimation: an experimental survey. Proceedings of the VLDB Endowment, 2017, 11( 4): 499–512

[15]	Batu T, Fortnow L, Rubinfeld R, Smith W D, White P. Testing that distributions are close. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science. 2000, 259–269

[16]	Paninski L . Estimation of entropy and mutual information. Neural Computation, 2003, 15( 6): 1191–1253

[17]	Orlitsky A, Santhanam N P, Viswanathan K, Zhang J. On modeling profiles instead of values. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. 2004, 426–435

[18]	Brutlag J D, Richardson T S . A block sampling approach to distinct value estimation. Journal of Computational and Graphical Statistics, 2002, 11( 2): 389–404

[19]	Chaudhuri S, Das G, Srivastava U. Effective use of block-level sampling in statistics estimation. In: Proceedings of 2004 ACM SIGMOD International Conference on Management of Data. 2004, 287–298

[20]	Cormode G, Muthukrishnan S . An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 2005, 55( 1): 58–75

[21]	Bloom B H . Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 1970, 13( 7): 422–426

[22]	Li B, Lu Y, Wang C, Kandula S. Q-error bounds of random uniform sampling for cardinality estimation. 2021, arXiv preprint arXiv: 2108.02715

[23]	Charikar M, Chaudhuri S, Motwani R, Narasayya V. Towards estimation error guarantees for distinct values. In: Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2000, 268–279

[24]	Goodman L A . On the estimation of the number of classes in a population. The Annals of Mathematical Statistics, 1949, 20( 4): 572–579

[25]	Good I J, Toulmin G H . The number of new species, and the increase in population coverage, when a sample is increased. Biometrika, 1956, 43( 1-2): 45–63

[26]	Bromwich T J I A. An Introduction to the Theory of Infinite Series. Providence: American Mathematical Society, 2005

[27]	Shlosser A . On estimation of the size of the dictionary of a long text on the basis of a sample. Engineering Cybernetics, 1981, 19( 1): 97–102

[28]	Haas P J, Stokes L . Estimating the number of classes in a finite population. Journal of the American Statistical Association, 1998, 93( 444): 1475–1487

[29]	Quenouille M H . Problems in plane sampling. The Annals of Mathematical Statistics, 1949, 20( 3): 355–375

[30]	Burnham K P, Overton W S . Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika, 1978, 65( 3): 625–633

[31]	Burnham K P, Overton W S . Robust estimation of population size when capture probabilities vary among animals. Ecology, 1979, 60( 5): 927–936

[32]	Heltshe J F, Forrester N E . Estimating species richness using the jackknife procedure. Biometrics, 1983, 39( 1): 1–11

[33]	Smith E P, van Belle G . Nonparametric estimation of species richness. Biometrics, 1984, 40( 1): 119–129

[34]	Efron B. Bootstrap methods: another look at the jackknife. In: Kotz S, Johnson N L, eds. Breakthroughs in Statistics: Methodology and Distribution. New York: Springer, 1992, 569–593

[35]	Horvitz D G, Thompson D J . A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 1952, 47( 260): 663–685

[36]	Särndal C E, Swensson B, Wretman J. Model Assisted Survey Sampling. New York: Springer, 2003

[37]	Chao A . Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics, 1984, 11( 4): 265–270

[38]	Chao A, Shen T. User’s guide for program spade (species prediction and diversity estimation). National Tsing Hua University, Dissertation, 2010

[39]	Chao A, Lee S M . Estimating the number of classes via sample coverage. Journal of the American statistical Association, 1992, 87( 417): 210–217

[40]	Good I J . The population frequencies of species and the estimation of population parameters. Biometrika, 1953, 40( 3-4): 237–264

[41]	Deolalikar V, Laffitte H. Extensive large-scale study of error in samping-based distinct value estimators for databases. 2016, arXiv preprint arXiv: 1612.00476

[42]	Motwani R, Vassilvitskii S. Distinct values estimators for power law distributions. In: Proceedings of the 3rd Workshop on Analytic Algorithmics and Combinatorics. 2006, 230–237

[43]	Korwar R M . On the observed number of classes from multivariate power series and hypergeometric distributions. Sankhyā: The Indian Journal of Statistics, Series B, 1988, 50( 1): 39–59

[44]	Valiant P, Valiant G. Estimating the unseen: improved estimators for entropy and other properties. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 2157–2165

[45]	Li J, Lei R, Wang S, Wei Z, Ding B . Learning-based property estimation with polynomials. Proceedings of the ACM on Management of Data, 2024, 2( 3): 1–27

[46]	Valiant G, Valiant P. Instance optimal learning of discrete distributions. In: Proceedings of the 48th Annual ACM Symposium on Theory of Computing. 2016, 142–155

[47]	Raghunathan A, Valiant G, Zou J. Estimating the unseen from multiple populations. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2855–2863

[48]	Valiant G, Valiant P . Estimating the unseen: improved estimators for entropy and other properties. Journal of the ACM (JACM), 2017, 64( 6): 37

[49]	Valiant, G., & Valiant, P. (2010). Estimating the unseen: A sublinear-sample canonical estimator of distributions. Electronic Colloquium on Computational Complexity (ECCC) , TR10-180.

[50]	Acharya J, Das H, Orlitsky A, Suresh A T. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 11–21

[51]	Pavlichin D S, Jiao J, Weissman T . Approximate profile maximum likelihood. Journal of Machine Learning Research, 2019, 20( 122): 1–55

[52]	Acharya J, Das H, Mohimani H, Orlitsky A, Pan S. Exact calculation of pattern probabilities. In: Proceedings of 2010 IEEE International Symposium on Information Theory. 2010, 1498–1502

[53]	Orlitsky A, Sajama S, Santhanam N P, Viswanathan K, Zhang J. Algorithms for modeling distributions over large alphabets. In: Proceedings of International Symposium onInformation Theory. 2004, 304

[54]	Orlitsky A, Santhanam N, Viswanathan K, Zhang J. Theoretical and experimental results on modeling low probabilities. In: Proceedings of 2006 IEEE Information Theory Workshop – ITW ’06 Punta del Este. 2006, 242–246

[55]	Vontobel P O. The bethe approximation of the pattern maximum likelihood distribution. In: Proceedings of 2012 IEEE International Symposium on Information Theory Proceedings. 2012, 2012–2016

[56]	Vontobel P O. The bethe and sinkhorn approximations of the pattern maximum likelihood estimate and their connections to the valiant-valiant estimate. In: Proceedings of 2014 Information Theory and Applications Workshop (ITA). 2014, 1–10

[57]	Wu Y, Yang P . Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 2016, 62( 6): 3702–3720

[58]	Wu Y, Yang P . Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, 2019, 47( 2): 857–883

[59]	Chien I. Regularized weighted chebyshev approximations for support estimation. University of Illinois at Urbana-Champaign, Dissertation, 2019

[60]	Eden T, Indyk P, Narayanan S, Rubinfeld R, Silwal S, Wagner T. Learning-based support estimation in sublinear time. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[61]	Wu R, Ding B, Chu X, Wei Z, Dai X, Guan T, Zhou J . Learning to be a statistician: learned estimator for number of distinct values. Proceedings of the VLDB Endowment, 2021, 15( 2): 272–284

[62]	Whang K Y, Vander-Zanden B T, Taylor H M . A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems (TODS), 1990, 15( 2): 208–229

[63]	Swamidass S J, Baldi P . Mathematical correction for fingerprint similarity measures to improve chemical retrieval. Journal of Chemical Information and Modeling, 2007, 47( 3): 952–964

[64]	Papapetrou O, Siberski W, Nejdl W . Cardinality estimation and dynamic length adaptation for bloom filters. Distributed and Parallel Databases, 2010, 28( 2): 119–156

[65]	Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. In: Proceedings of the 28th Annual ACM Symposium on Theory of Computing. 1996, 20–29

[66]	Cormode G. Count-min sketch. In: Liu L, Özsu M T, eds. Encyclopedia of Database Systems. New York: Springer, 2009, 511–516

[67]	Gibbons P B, Tirthapura S. Estimating simple functions on the union of data streams. In: Proceedings of the 13th Annual ACM Symposium on Parallel Algorithms and Architectures. 2001, 281–291

[68]	Bar-Yossef Z, Jayram T S, Kumar R, Sivakumar D, Trevisan L. Counting distinct elements in a data stream. In: Proceedings of the 6th International Workshop on Randomization and Approximation Techniques in Computer Science. 2002, 1–10

[69]	Beyer K, Gemulla R, Haas P J, Reinwald B, Sismanis Y . Distinct-value synopses for multiset operations. Communications of the ACM, 2009, 52( 10): 87–95

[70]	Cohen E . Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences, 1997, 55( 3): 441–453

[71]	Dasgupta A, Lang K J, Rhodes L, Thaler J. A framework for estimating stream expression cardinalities. In: Proceedings of the 19th International Conference on Database Theory. 2016

[72]	Flajolet P, Martin G N . Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 1985, 31( 2): 182–209

[73]	Alon N, Matias Y, Szegedy M . The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 1999, 58( 1): 137–147

[74]	Durand M, Flajolet P. Loglog counting of large cardinalities. In: Proceedings of the 11th Annual European Symposium on Algorithms. 2003, 605–617

[75]	Flajolet P, Fusy É, Gandouet O, Meunier F. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Proceedings of 2007 Conference on Analysis of Algorithms. 2007, 127–146

[76]	Hall A, Bachmann O, Büssow R, Gănceanu S, Nunkesser M . Processing a trillion cells per mouse click. Proceedings of the VLDB Endowment, 2012, 5( 11): 1436–1446

[77]	Heule S, Nunkesser M, Hall A. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In: Proceedings of the 16th International Conference on Extending Database Technology. 2013, 683–692

[78]	Kane D M, Nelson J, Woodruff D P. An optimal algorithm for the distinct elements problem. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2010, 41–52

[79]	Ting D. Approximate distinct counts for billions of datasets. In: Proceedings of the 2019 International Conference on Management of Data. 2019, 69–86

[80]	Chiosa M, Preußer T B, Alonso G . SKT: a one-pass multi-sketch data analytics accelerator. Proceedings of the VLDB Endowment, 2021, 14( 11): 2369–2382

[81]	Ertl O . SetSketch: filling the gap between MinHash and HyperLogLog. Proceedings of the VLDB Endowment, 2021, 14( 11): 2244–2257

[82]	Cormode G, Yi K. Small Summaries for Big Data. Cambridge: Cambridge University Press, 2020

[83]	Li J, Wei Z, Ding B, Dai X, Lu L, Zhou J. Sampling-based estimation of the number of distinct values in distributed environment. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022, 893–903

[84]	Charikar M, Chen K, Farach-Colton M. Finding frequent items in data streams. In: Proceedings of the 29th International Colloquium on International Colloquium on Automata, Languages, and Programming. 2002, 693–703

[85]	Wang P, Xie D, Zhao J, Li J, Li Z, Li R, Ren Y, Di J . Half-Xor: a fully-dynamic sketch for estimating the number of distinct values in big tables. IEEE Transactions on Knowledge and Data Engineering, 2024, 36( 7): 3111–3125

[86]	Wang F, Chen Q, Li Y, Yang T, Tu Y, Yu L, Cui B . JoinSketch: a sketch algorithm for accurate and unbiased inner-product estimation. Proceedings of the ACM on Management of Data, 2023, 1( 1): 81

[87]	Shekelyan M, Cormode G. Sequential random sampling revisited: hidden shuffle method. In: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics. 2021, 3628–3636

[88]	Wei C, Salloum S, Emara T Z, Zhang X, Huang J Z, He Y. A two-stage data processing algorithm to generate random sample partitions for big data analysis. In: Proceedings of the 11th International Conference on Cloud Computing. 2018, 347–364

[89]	Salloum S, Huang J Z, He Y . Random sample partition: a distributed data model for big data analysis. IEEE Transactions on Industrial Informatics, 2019, 15( 11): 5846–5854

[90]	Debnath S K, Dutta R. Secure and efficient private set intersection cardinality using bloom filter. In: Proceedings of the 18th International Conference on Information Security. 2015, 209–226

[91]	Guo D, Wu J, Chen H, Yuan Y, Luo X . The dynamic bloom filters. IEEE Transactions on Knowledge and Data Engineering, 2010, 22( 1): 120–133

[92]	Ertl O. New cardinality estimation algorithms for HyperLogLog sketches. 2017, arXiv preprint arXiv: 1702.01284

[93]	Ertl O . UltraLogLog: a practical and more space-efficient alternative to HyperLogLog for approximate distinct counting. Proceedings of the VLDB Endowment, 2024, 17( 7): 1655–1668

[94]	Tsan B, Datta A, Izenov Y, Rusu F . Approximate sketches. Proceedings of the ACM on Management of Data, 2024, 2( 1): 66

[95]	Wang D, Pettie S. Better cardinality estimators for HyperLogLog, PCSA, and beyond. In: Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 2023, 317–327

[96]	Pettie S, Wang D, Yin L. Non-mergeable sketching for cardinality estimation. In: Proceedings of the 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). 2021

[97]	Qiu Y, Wang Y, Yi K, Li F, Wu B, Zhan C. Weighted distinct sampling: cardinality estimation for SPJ queries. In: Proceedings of 2021 International Conference on Management of Data. 2021, 1465–1477

[98]	Dai B, Hu X, Yi K . Reservoir sampling over joins. Proceedings of the ACM on Management of Data, 2024, 2( 3): 1–26

[99]	Kim K, Jung J, Seo I, Han W S, Choi K, Chong J. Learned cardinality estimation: an in-depth study. In: Proceedings of 2022 International Conference on Management of Data. 2022, 1214–1227

[100]

Kipf A, Kipf T, Radke B, Leis V, Boncz P A, Kemper A. Learned cardinalities: estimating correlated joins with deep learning. In: Proceedings of the 9th Biennial Conference on Innovative Data Systems Research. 2019

[101]

Sun J, Li G . An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment, 2019, 13( 3): 307–319

[102]

Negi P, Wu Z, Kipf A, Tatbul N, Marcus R, Madden S, Kraska T, Alizadeh M . Robust query driven cardinality estimation under changing workloads. Proceedings of the VLDB Endowment, 2023, 16( 6): 1520–1533

[103]

Hilprecht B, Schmidt A, Kulessa M, Molina A, Kersting K, Binnig C. DeepDB: learn from data, not from queries! Proceedings of the VLDB Endowment, 2020, 13(7): 992–1005

[104]

Yang Z, Kamsetty A, Luan S, Liang E, Duan Y, Chen X, Stoica I . NeuroCard: one cardinality estimator for all tables. Proceedings of the VLDB Endowment, 2020, 14( 1): 61–73

[105]

Chien E, Milenkovic O, Nedich A. Support estimation with sampling artifacts and errors. In: Proceedings of 2021 IEEE International Symposium on Information Theory (ISIT). 2021, 244–249

[106]

Hao Y, Orlitsky A. Unified sample-optimal property estimation in near-linear time. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 996

[107]

Shan J, Fu Y, Ni G, Luo J, Wu Z . Fast counting the cardinality of flows for big traffic over sliding windows. Frontiers of Computer Science, 2017, 11( 1): 119–129

Acknowledgements

This research was supported in part by the National Science and Technology Major Project (2022ZD0114802), the National Natural Science Foundation of China (Grant Nos. U2241212, 61932001), the Beijing Natural Science Foundation (No. 4222028), by the Beijing Outstanding Young Scientist Program (No.BJJWZYJH012019100020098), the Huawei-Renmin University joint program on Information Retrieval. We also wish to acknowledge the support provided by the fund for building world-class universities (disciplines) of Renmin University of China, by Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Ministry of Education, by Intelligent Social Governance Interdisciplinary Platform, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Public Policy and Decision-making Research Lab, and Public Computing Cloud, Renmin University of China.