String similarity search and join: a survey
Minghe YU, Guoliang LI, Dong DENG, Jianhua FENG
String similarity search and join: a survey
String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been extensively studied in the recent decade, there is no thorough survey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for string similarity search and join. We also discuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Finally, we provide some open datasets and summarize some research challenges and open problems.
string similarity / similarity search / similarity join / top-k
[1] |
Zhang C J, Chen L, Tong Y, Liu Z. Cleaning uncertain data with a noisy crowd. In: Proceedings of the 31st IEEE International Conference on Data Engineering. 2015, 6–17
|
[2] |
Papotti P, Naumann F, Kruse S. Estimating data integration and cleaning effort. In: Proceedings of International Conference on Extending Database Technology. 2015, 61–72
|
[3] |
Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA: a data cleaning system powered by knowledge bases and crowdsourcing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, 1247–1261
|
[4] |
Verma P, Kesswani N. Web usage mining framework for data cleaning and IP address identification. 2014, arXiv: 1408.5460v1
|
[5] |
Maccio V J, Chiang F, Down D G. Models for distributed, large scale data cleaning. In: Proceedings of Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. 2014, 369–380
|
[6] |
Almeida R, Oliveira P, Braga L, Barroso J. Ontologies for reusing data cleaning knowledge. In: Proceedings of International Catholic Stewardship Council. 2012, 238–241
|
[7] |
Fan J, Li G, Zhou L, Chen S, Hu J. SEAL: spatio-textual similarity search. The Proceedings of the VLDB Endowment, 2012, 5(9): 824–835
|
[8] |
Yu M, Li G, Wang T, Feng J, Gong Z. Efficient filtering algorithms for location-aware publish/subscribe. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(4): 950–963
|
[9] |
Li G, Ooi B C, Feng J, Wang J, Zhou L. EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2008, 903–914
|
[10] |
Badgeley M A, Sealfon S C, Chikina M D. Hybrid Bayesian-rank integration approach improves the predictive power of genomic dataset aggregation. Bioinformatics, 2015, 31(2): 209–215
|
[11] |
Lui T, Tsui N, Chan L W, Wong C, Siu P, Yung B Y M. DECODE: an integrated differential co-expression and differential expression analysis of gene expression data. BMC Bioinformatics, 2015, 16: 182
|
[12] |
Arfaoui N, Akaichi J. Automating schema integration technique case study: generating data warehouse schema from data mart schemas. Communications in Computer and Information Science, 2015, 521: 200–209
|
[13] |
Nastase V, Fahrni A. Coarse-grained cross-lingual alignment of comparable texts with topic models and encyclopedic knowledge. 2014, arXiv: 1411.7820v1
|
[14] |
Srikantaiah K C, Suraj M, Venugopal K R, Patnaik L.M. Similarity based dynamic web data extraction and integration system from search engine result pages for web content mining. ACEEE International Journal on Information Technology, 2013, 3(1): 42–49
|
[15] |
Cevahir A. Scalable textual similarity search on large document collections through random indexing and K-means clustering. In: Proceedings of Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. 2014, 231–238
|
[16] |
Yin J, Wang J. A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014,233–242
|
[17] |
Dai Z, Sun A, Liu X. Crest: cluster-based representation enrichment for short text classification. In: Proceedings of Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. 2013, 256–267
|
[18] |
SureshReddy G, Rajinikanth T V, Rao A A. Design and analysis of novel similarity measure for clustering and classification of high dimensional text documents. In: Proceedings of the 15th International Conference on Computer Systems and Technologies. 2014, 194–201
|
[19] |
Liu S, Li G, Feng J. A prefix-filter based method for spatio-textual similarity join. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(10): 2354–2367
|
[20] |
Wang J, Li G, Kraska T, Franklin M J, Feng J. Leveraging transitive relations for crowdsourced joins. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2013, 229–240
|
[21] |
Wang J, Li G, Yu J X, Feng J. Entity matching: how similar is similar. The Proceedings of the VLDB Endowment, 2011, 4(10): 622–633
|
[22] |
Chaudhuri S, Ganjam K, Ganti V, Motwani R. Robust and efficient fuzzy match for online data cleaning. In: Proceedings of ACM SIGMOD international conference on Management of data. 2003, 313–324
|
[23] |
Wang J, Li G, Feng J. Fast-join: an efficient method for fuzzy token matching based string similarity join. In: Proceedings of the 27th IEEE International Conference on Data Engineering. 2011, 458–469
|
[24] |
Wang J, Li G, Feng J. Extending string similarity join to tolerant fuzzy token matching. ACM Transactions on Database Systems, 2014, 39(1):7
|
[25] |
Nandi A, Jagadish H V. Effective phrase prediction. In: Proceedings of the 33rd International Conference on Very Large Databases, 2007, 219–230
|
[26] |
Ji S, Li G, Li C, Feng J. Efficient interactive fuzzy keyword search. In: Proceedings of the 18th International Conference onWorld Wide Web. 2009, 371–380
|
[27] |
Chaudhuri S, Kaushik R. Extending autocompletion to tolerate errors. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2009, 707–718
|
[28] |
Zheng Y, Bao Z, Shou L, Tung A K. MESA: a map service to support fuzzy type-ahead search over geo-textual data. Proceedings of the VLDB Endowment, 2014, 7(13): 1545–1548
|
[29] |
Li G, Ji S, Li C, Feng J. Efficient type-ahead search on relational data: a TASTIER approach. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2009, 695–706
|
[30] |
Kavila S D, Ravva R, Bandaru R. Fuzzy type-ahead keyword search in RDF data. In: Proceedings of International Conference on Frontiers of Intelligent Computing: Theory and Applications. 2013, 67–73
|
[31] |
Chandel A, Nagesh P C, Sarawagi S. Efficient batch top-k search for dictionary-based entity recognition. In: Proceedings of the 22nd IEEE International Conference on Data Engineering. 2006, 28
|
[32] |
Cowan B, Zethelius S, Luk B, Baras T, Ukarde P, Zhang D. Named entity recognition in travel-related search queries. In: Proceedings of Association for the Advancement of Artificial Intelligence Conference, 2015, 3935–3941
|
[33] |
Tang Z, Jiang L, Yang L, Li K, Li K. CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework. Cluster Computing, 2015, 18(2): 493–505
|
[34] |
Lu W, Fung G P C, Du X, Zhou X, Chen L, Deng K. Approximate entity extraction in temporal databases. World Wide Web, 2011, 14(2): 157–186
|
[35] |
Wang W, Xiao C, Lin X, Zhang C. Efficient approximate entity extraction with edit distance constraints. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2009, 759–770
|
[36] |
Deng D, Li G, Feng J. An efficient trie-based method for approximate entity extraction with edit-distance constraints. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 762–773
|
[37] |
Nakajima D, Mitsui Y, Samejima M, Akiyoshi M. An information extraction method from different structural web sites by word distances between a user instantiated label and similar entity. In: Proceedings of International Conference on Data Mining Workshops, 2011, 1177–1182
|
[38] |
Deng D, Li G, Feng J, Duan Y, Gong Z. A unified framework for approximate dictionary-based entity extraction. The International Journal on Very Large Data Bases, 2015, 24(1): 143–167
|
[39] |
Kim Y, Shim K. Efficient top-k algorithms for approximate substring matching. In: Proceedings of ACMSIGMOD International Conference on Management of Data. 2013, 385–396
|
[40] |
Tang N, Sidirourgos L, Boncz P A. Space-economical partial gram indices for exact substring matching. In: Proceedings of the 18th ACM International Conference on Information and Knowledge Management. 2009, 285–294
|
[41] |
Ge T, Li Z. Approximate substring matching over uncertain strings. Proceedings of the VLDB Endowment, 2011, 4(11): 772–782
|
[42] |
Warren R H, Tompa F W. Multi-column substring matching for database schema translation. In: Proceedings of the 32nd International Conference on Very Large Databases. 2006, 331–342
|
[43] |
Jokinen P, Ukkonen E. Two algorithms for approximate string matching in static texts. In: Proceedings of the 16th International Symposium on Mathematical Foundations of Computer Science. 1991, 240–248
|
[44] |
Li C, Wang B, Yang X. VGRAM: improving performance of approximate queries on string collections using variable-length grams. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 303–314
|
[45] |
Yang X, Wang B, Li C. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2008, 353–364
|
[46] |
Wang J, Li G, Deng D, Zhang Y, Feng J. Two birds with one stone: an efficient hierarchical framework for top-k and threshold-based string similarity search. In: Proceedings of International Conference on Data Engineering. 2015, 519–530
|
[47] |
Deng D, Li G, Feng J, Li W S. Top-k string similarity search with edit distance constraints. In: Proceedings of the 29th IEEE International Conference on Data Engineering. 2013, 925–936
|
[48] |
Wang X, Ding X, Tung A K H., Zhang Z. Efficient and effective KNN sequence search with approximate n-grams. Proceedings of the VLDB Endowment, 2013, 7(1): 1–12
|
[49] |
Fagin R, Lotem A, Naor M. Optimal aggregation algorithms for middleware. Journal of Computer and System Sciences, 2003, 66(4): 614–656
|
[50] |
Li C, Lu J, Lu Y. Efficient merging and filtering algorithms for approximate string searches. In: Proceedings of the 24th IEEE International Conference on Data Engineering. 2008, 257–266
|
[51] |
Siragusa E, Weese D, Reinert K. Scalable string similarity search/join with approximate seeds and multiple backtracking. In: Proceedings of EDBT/ICDT Joint Conference. 2013, 370–374
|
[52] |
Liu X, Li G, Feng J, Zhou L. Effective indices for efficient approximate string search and similarity join. In: Proceedings of the 9th IEEE International Conference on Web-Age Information Management. 2008, 127–134
|
[53] |
Cui J, Meng D, Chen Z. Leveraging deletion neighborhoods and trie for efficient string similarity search and join. Lecture Notes in Computer Science, 2014, 8870: 1–13
|
[54] |
Gravano L, Ipeirotis P G, Jagadish H V, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. In: Proceedings of International Conference on Very Large Data Bases. 2001, 491–500
|
[55] |
Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd IEEE International Conference on Data Engineering. 2006, 5
|
[56] |
Qin J, Wang W, Lu Y, Xiao C, Lin X. Efficient exact edit similarity query processing with the asymmetric signature scheme. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2011, 1033–1044
|
[57] |
Rong C, Lu W, Wang X, Du X, Chen Y, Tung A K H. Efficient and scalable processing of string similarity join. IEEE Transactions on Knowl edge and Data Engineering, 2013, 25(10): 2217–2230
|
[58] |
Xiao C, Wang W, Lin X. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proceedings of the VLDB Endowment, 2008, 1(1): 933–944
|
[59] |
Xiao C, Wang W, Lin X, Yu J X. Efficient similarity joins for near duplicate detection. In: Proceedings of International World Wide Web Conference. 2008, 131–140
|
[60] |
Xiao C, Wang W, Lin X, Yu J X, Wang G. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems, 2011, 36(3): 15
|
[61] |
Wang W, Qin J, Xiao C, Lin X, Shen H T. VChunkJoin: an efficient algorithm for edit similarity joins. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(8): 1916–1929
|
[62] |
Qin J, Wang W, Xiao C, Lu Y, Lin X, Wang H. Asymmetric signature schemes for efficient exact edit similarity query processing. ACM Transactions on Database Systems, 2013, 38(3): 16
|
[63] |
Wang J, Li G, Feng J. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2012, 85–96
|
[64] |
Deng D, Li G, Feng J. A pivotal prefix based filtering algorithm for string similarity search. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2014, 673–684
|
[65] |
Li G, Deng D, Wang J, Feng J. PASS-JOIN: a partition-based method for similarity joins. Proceedings of the VLDB Endowment, 2011, 5(3): 253–264
|
[66] |
Li G, Deng D, Feng J. A partition-based method for string similarity joins with edit-distance constraints. ACM Transactions on Database Systems, 2013, 38(2): 9
|
[67] |
Li G, He J, Deng D, Li J. Efficient similarity join and search on multiattribute data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2015, 1137–1151
|
[68] |
Ciaccia P, Patella M, Zezula P. M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of International Conference on Very Large Databases. 1997, 426–435
|
[69] |
Aßfalg J, Borgwardt K M, Kriegel H P. 3D String: a feature string kernel for 3D object classification on voxelized data. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management. 2006, 198–207
|
[70] |
Bartolini I, Ciaccia P, Patella M. String matching with metric trees using an approximate distance. In: Proceedings of 9th International Symposium on String Processing and Information Retrieval. 2002, 271–283
|
[71] |
Wang J, Feng J, Li G. Trie-join: efficient trie-based string similarity joins with edit-distance constraints. Proceedings of the VLDB Endowment, 20103, (1–2): 1219–1230
|
[72] |
Feng J, Wang J, Li G. Trie-join: a trie-based method for efficient string similarity joins. The International Journal on Very Large Data Bases, 2012, 21(4): 437–461
|
[73] |
Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In: Proceedings of the 32nd International Conference on Very Large Data Bases. 2006, 918–929
|
[74] |
Xiao C, Wang W, Lin X, Shang H. Top-k set similarity joins. In: Proceedings of the 25th IEEE International Conference on Data Engineering. 2009, 916–927
|
[75] |
Zhang Z, Hadjieleftheriou M, Ooi B C, Srivastava D. Bed-tree: an allpurpose index structure for string similarity search based on edit distance. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2010, 915–926
|
[76] |
Deng D, Li G, Hao S, Wang J, Feng J. Massjoin: a MapReduce-based method for scalable string similarity joins. In: Proceedings of the 30th IEEE International Conference on Data Engineering. 2014, 340–351
|
[77] |
Afrati F N, Sarma A D, Menestrina D, Parameswaran A G, Ullman J D. Fuzzy joins using MapReduce. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 498–509
|
[78] |
Vernica R, Carey M J, Li C. Efficient parallel set-similarity joins using MapReduce. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2010, 495–506
|
[79] |
Metwally A, Faloutsos C. V-SMART-join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proceedings of the VLDB Endowment, 2012, 5(8): 704–715
|
[80] |
Deng D, Jiang Y, Li G, Li J, Yu C. Scalable column concept determination for web tables using large knowledge bases. Proceedings of the VLDB Endowment, 2013, 6(13): 1606–1617
|
[81] |
Li G, Wang J, Li C, Feng J. Supporting efficient top-k queries in typeahead search. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012, 355–364
|
[82] |
Li G, Ji S, Li C, Feng J. Efficient fuzzy full-text type-ahead search. The International Journal on Very Large Data Bases, 2011, 20(4): 617–640
|
[83] |
Xiao C, Qin J, Wang W, Ishikawa Y, Tsuda K, Sadakane K. Efficient error-tolerant query autocompletion. Proceedings of the VLDB Endowment, 2013, 6(6): 373–384
|
[84] |
Li G, Feng J, Li C. Supporting search-as-you-type using SQL in databases. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(2): 461–475
|
[85] |
Li G, Deng D, Feng J. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2011, 529–540
|
[86] |
Li G, Hu J, Feng J, Tan K. Effective location identification from microblogs. In: Proceedings of the 30th IEEE International Conference on Data Engineering. 2014, 880–891
|
[87] |
Ukkonen E. Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 1992, 92(1): 191–211
|
[88] |
Navarro G, Baeza-Yates R A, Sutinen E, Tarhio J. Indexing Methods for Approximate String Matching. IEEE Data Engineering Bulletin, 2001, 24(4): 19–27
|
[89] |
Jiang Y, Li G, Feng J, Li W. String similarity joins: an experimental evaluation. Proceedings of the VLDB Endowment, 2014, 7(8): 625–636
|
[90] |
Jiang Y, Deng D, Wang J, Li G, Feng J. Efficient parallel partitionbased algorithms for similarity search and join with edit distance constraints. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops. 2013, 341–348
|
[91] |
Wandelt S, Deng D, Gerdjikov S, Mishra S, Mitankin P, Patil M, Siragusa E, Tiskin A,Wang W, Wang J, Leser U. State-of-the-art in string similarity search and join. ACM SIGMOD Record, 2014, 43(1): 64–76
|
/
〈 | 〉 |