String similarity join with different similarity thresholds based on novel indexing techniques

Chuitian RONG; Yasin N. SILVA; Chunqing LI

doi:10.1007/s11704-016-5231-1

Front. Comput. Sci. ›› 2017, Vol. 11 ›› Issue (2) :307 -319. DOI: 10.1007/s11704-016-5231-1

RESEARCH ARTICLE

String similarity join with different similarity thresholds based on novel indexing techniques

Author information +

History +

PDF (741KB)

Abstract

String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a solution for string similarity joins that supports different similarity thresholds in a single operator. In order to support different thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing techniques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while providing a more flexible threshold specification.

Keywords

Cite this article

Download citation ▾

Chuitian RONG, Yasin N. SILVA, Chunqing LI. String similarity join with different similarity thresholds based on novel indexing techniques. Front. Comput. Sci., 2017, 11 (2) : 307-319 DOI:10.1007/s11704-016-5231-1

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Monge A, Elkan C. The field matching problem: algorithms and applications. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1996, 267–270

[2]	Zhang Z J, Hadjieleftheriou M, Ooi B, Srivastava D. Bed-tree: an allpurpose index structure for string similarity search based on edit distance. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2010, 915–926

[3]	Lu W, Du X Y, Hadjieleftheriou M, Ooi B C. Efficiently supporting edit distance based string similarity search using b+-trees. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(12): 2983–2996

[4]	Wang J N, Feng J H, Li G L. Trie-join: efficient trie-based string similarity joins with edit-distance constraints. Proceedings of the VLDB Endowment, 2010, 3(1–2): 1219–1230

[5]	Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2004, 743–754

[6]	Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd IEEE International Conference on Data Engineering. 2006, 61–72

[7]	Bayardo R, Ma Y, Srikant R. Scaling up all pairs similarity search. In: Proceedings of the 16th ACM International Conference onWorldWide Web. 2007, 131–140

[8]	Xiao C, Wang W, Lin X M, Yu J. Efficient similarity joins for near duplicate detection. In: Proceedings of ACM International Conference on World Wide Web. 2008, 563–574

[9]	Hernández M, Stolfo S. The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 1995, 127–138

[10]	Winkler W E. The state of record linkage and current research problems. Technical Report, Statistical Research Division, U.S. Census Bureau. 1999

[11]	Sivic J, Zisserman A. Video google: a text retrieval approach to object matching in videos. 2003, 1470–1477

[12]	Dong X, Halevy A, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 85–96

[13]	Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278

[14]	Arasu A, Ré C, Suciu D. Large-scale deduplication with constraints using dedupalog. In: Proceedings of the 25th IEEE International Conference on Data Engineering. 2009, 952–963

[15]	Gravano L, Ipeirotis P G, Jagadish H V, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. In: Proceedings of the VLDB Endowment. 2001, 491–500

[16]	Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16

[17]	Naumann F, Herschel M. An Introduction to duplicate detection. Synthesis Lectures on Data Management, 2010, 2(1): 1–87

[18]	Jiang Y, Li G L, Feng J H, Li W S. String similarity joins: an experimental evaluation. Proceedings of the VLDB Endowment, 2014, 7(8): 625–636

[19]	Chaudhuri S, Kaushik R. Extending autocompletion to tolerate errors. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2009, 707–718

[20]	Deng D, Li G L, Feng J H. A pivotal prefix based filtering algorithm for string similarity search. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2014, 673–684

[21]	Wang J N, Li G L, Feng J H. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2012, 85–96

[22]	Rong C T, Lu W, Wang X L, Du X Y, Chen Y G, Tung A K H. Efficient and scalable processing of string similarity join. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(10): 2217–2230

[23]	Lu J H, Lin C B, Wang W, Li C, Wang H Y. String similarity measures and joins with synonyms. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2013, 373–384

[24]	Li G L, He J, Deng D, Li J. Efficient similarity join and search on multi-attribute data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2015, 1137–1151

[25]	Salton G, McGill M J. Introduction to Modern Information Retrieval. New York: McGraw-Hill, Inc., 1986

[26]	Witten I H, Moffat A, Bell T C. Managing Gigabytes: Compressing and Indexing Documents and Images. 2nd ed. San Francisco, CA: Morgan Kaufmann, 1999

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag Berlin Heidelberg

PDF (741KB)

1468

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submission

Call for papers

Guidelines for authors

Download templates

Guidelines for reviewers

Abstract

Keywords

Cite this article

References

RIGHTS & PERMISSIONS