Efficient histogram-based range query estimation for dirty data

Yan ZHANG; Hongzhi WANG; Long YANG; Jianzhong LI

doi:10.1007/s11704-016-5551-1

Front. Comput. Sci. ›› 2018, Vol. 12 ›› Issue (5) :984 -999. DOI: 10.1007/s11704-016-5551-1

RESEARCH ARTICLE

Efficient histogram-based range query estimation for dirty data

Author information +

History +

PDF (564KB)

Abstract

In recent years, data quality issues have attracted wide attentions. Data quality problems are mainly caused by dirty data. Currently, many methods for dirty data management have been proposed, and one of them is entity-based relational database in which one tuple represents an entity. The traditional query optimizations are not suitable for the new entity-based model. Then new query optimizations need to be developed. In this paper, we propose a new query selectivity estimation strategy based on histogram, and focus on solving the overestimation which traditional methods lead to. We prove our approaches are unbiased. The experimental results on both real and synthetic data sets show that our approaches can give good estimates with low error.

Keywords

query estimation / data quality / histogram / dirty data management

Cite this article

Download citation ▾

Yan ZHANG, Hongzhi WANG, Long YANG, Jianzhong LI. Efficient histogram-based range query estimation for dirty data. Front. Comput. Sci., 2018, 12 (5) : 984-999 DOI:10.1007/s11704-016-5551-1

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Batini C, Scannapieco M. Data Quality: Concepts, Methodologies and Techniques. New York: Springer Publishing Company, Inc., 2006

[2]	Lenzerini M. Data integration: a theoretical perspective. In: Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2015, 233–246

[3]	Dong X L, Halevy A, Yu C. Data integration with uncertainty. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(2): 469–500

[4]	Redman T. The impact of poor data quality on the typical enterprise. Communications of the ACM, 1998, 41(2): 49–71

[5]	Raman D, Ton Z. Execution: the missing link in retail operations. Jutas Bus.l, 2001, 43(3): 489–503

[6]	English L P. Information quality management: the next frontier. In: Proceedings of ASQ World Conference on Quality and Improvement. 2001

[7]	Rahm E, Do H H. Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin, 2000, 23(23): 3–13

[8]	Fan W F, Li J, Ma S, Tang N, Yu W. Interaction between record matching and data repairing. Journal of Data & Information Quality, 2011, 4(4): 469–480

[9]	Fuxman A D, Miller R J. First-order query rewriting for inconsistent databases. In: Proceedings of International Conference on Database Theory. 2005, 337–351

[10]	Andritsos P, Fuxman A, Miller R J. Clean answers over dirty databases: a probabilistic approach. IEEE Computer Society, 2006, 30

[11]	Wolf G, Kalavagattu A, Khatri H, Balakrishnan R, Chokshi B, Fan J, Chen Y, Kambhampati S. Query processing over incomplete autonomous databases: query rewriting using learned data dependencies. The VLDB Journal, 2009, 18(5): 1167–1190

[12]	Fuxman A, Fazli E, Miller J. Conquer: efficient management of inconsistent databases. In: Proceedings of SIGMOD Conference. 2005, 155–166

[13]	Boulos J, Dalvi N, Mandhani B, Mathur S, Re C, Suciu D. MYSTIQ: a system for finding more answers by using probabilities. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 891–893

[14]	Dalvi N, Suciu D. Management of probabilistic data: foundations and challenges. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2007, 1–12

[15]	Widom J. Trio: a system for integrated management of data, accuracy, and lineage. In: Proceedings of the Conference on Innovative Data Systems Research (CIDR). 2005, 262–276

[16]	Hassanzadeh O, Miller R J. Creating probabilistic databases from duplicated data. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(5): 1141–1166

[17]	Benjelloun O, Garcia-Molina H, Menestrina D, Whang S E, Su Q, Widom J. Swoosh: a generic approach to entity resolution. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(1): 255–276

[18]	Whang S E, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H. Entity resolution with iterative blocking. In: Proceedings of the 35th SIGMOD International Conference on Management of Data. 2009, 219–232

[19]	Li Y, Wang H, Gao H. Efficient entity resolution based on sequence rules. In: Proceedings of Communications in Computer and Information Science. 2011, 381–388

[20]	Lu W, Fung G P C, Du X, Zhou X, Chen L, Deng K. Approximate entity extraction in temporal databases. World Wide Web, 2011, 14(2): 157–186

[21]	Zhang W J, Zhan L M, Zhang Y, Cheema M A, Lin X M. Efficient top-k similarity join processing over multi-valued objects. World Wide Web, 2014, 17(3): 285–309

[22]	Ioannidis Y E. The history of histograms (abridged). In: Proceedings of the 29th International Conference on Very Large Data Bases. 2004, 19–30

[23]	Cormode G, Garofalakis M. Histograms and wavelets on probabilistic data. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(8): 1142–1157

[24]	Cormode G, Deligiannakis A, Garofalakis M, McGregor A. Probabilistic histograms for probabilistic data. Proceedings of the VLDB Endowment, 2009, 2(1): 526–537

[25]	Wang H Z, Liu X L, Li J Z, Tong X, Yang L, Li Y K. EntityManager: an entity-based dirty data management system. In: Proceedings of International Conference on Database Systems for Advanced Applications. 2013, 468–471

[26]	Abiteboul S, Kanellakis P, Grahne G. On the representation and querying of sets of possible worlds. Theoretical Computer Science, 1987, 16(3): 34–48

[27]	Fuhr N, Rolleke T. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems, 1997, 15(1): 32–66

[28]	Lakshmanan L, Leone N, Ross R, Subrahmanian V S. Probview: a flexible probabilistic database system. ACM Transactions on Database Systems, 1997, 22(3): 419–469

[29]	Nierman A, Jagadish H. ProTDB: probabilistic data in XML. In: Proceedings of the 28th International Conference on Very Large Data Bases. 2002, 646–657

[30]	Jin C Q, Yi K, Chen L, Yu J X, Lin X. Sliding-window top-k queries on uncertain streams. Proceedings of the VLDB Endowment, 2008, 1(1): 301–312

[31]	Burdick D, Deshpande P M, Jayram T S, Ramakrishnan R, Vaithyanathan S. OLAP over uncertain and imprecise data. The VLDB Journal—The International Journal on Very Large Data Bases, 2007, 16(1): 123–144

[32]	Qi Y, Jain R, Singh S, Prabhakar S. Threshold query optimization for uncertain data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2010, 315–326

[33]	Tao Y F, Cheng R, Xiao X K, Ngai W K, Kao B, Prabhakar S. Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005, 922–933

[34]	Tao Y F, Xiao X K, Cheng R. Range search on multidimensional uncertain data. ACM Transactions on Database Systems, 2007, 32(3): 15

[35]	Dalvi N, Suciu D. Efficient query evaluation on probabilistic databases. In: Proceedings of International Conference on Very Large Databases. 2008, 16(1): 119–128

[36]	Cheng R, Kalashnikov D V, Prabhakar S. Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2003, 551–562

[37]	Pei J, Jiang B, Lin X M, Yuan Y D. Probabilistic skylines on uncertain data. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 15–26

[38]	Dellis E, Seeger B. Efficient computation of reverse skyline queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 291–302

[39]	Soliman M A, Ilyas I F, Chang K C C. Top-k query processing in uncertain databases. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 896–905

[40]	Ge T, Zdonik S, Madden S. Top-k queries on uncertain data: on score distribution and typical answers. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2009, 375–388

[41]	Wang G R, Huo H, Han D H, Hui X Y. Query processing and optimization techniques over streamed fragmented XML. World Wide Web, 2008, 11(3): 339–359

[42]	Barbosa D, Mignet L, Veltri P. Studying the XML Web: gathering statistics from an XML sample. World Wide Web, 2006, 9(2): 187–212

[43]	Kooi R. The optimization of queries in relational databases. Dissertation for the Doctoral Degree. Cleveland, Ohio: Case Western Reserve University, 1980

[44]	Piatetsky-Shapiro G, Connell C. Accurate estimation of the number of tuples satisfying a condition. ACM SIGMOD Record, 1984, 14(2): 256–276

[45]	Ioannidis Y, Poosala V. Balancing histogram optimality and practicality for query result size estimation. ACM SIGMOD Record, 1995, 24(2): 233–244

[46]	Gunopulos D, Kollios G, Tsotras V J, Domeniconi C. Approximating multi-dimensional aggregate range queries over real attributes. ACM SIGMOD Record, 2000, 29(2): 463–474.

[47]	Bruno N, Chaudhuri S, Gravano L. STHoles: a multidimensional workload aware histogram. ACM SIGMOD Record, 2001, 30(2): 211–222

[48]	Haas P J, Naughton J F, Seshadri S, Swami A N. Selectivity and cost estimation for joins based on random sampling. Journal of Computer and System Sciences, 1996, 52(3): 550–569

[49]	Lipton R J, Naughton J F. Query size estimation by adaptive sampling. Journal of Computer and System Sciences, 1995, 51(1): 18–25

[50]	Olken F. Random sampling from databases. Dissertation for the Doctoral Degree. University of California at Berkeley, 1997

[51]	Ngu A, Harangsri B, Shepherd J. Query size estimation for joins using systematic sampling. Distributed and Parallel Databases, 2004, 15(3): 237–275

[52]	Chaudhuri S, Das G, Narasayya V R. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems, 2007, 32(2): 9

[53]	Zhang Y, Yang L, Wang H Z. Range query estimation for dirty data management system. In: Proceedings of International Conference on Web-Age Information Management. 2012, 152–164