FrepJoin: an efficient partition-based algorithm for edit similarity join
Ji-zhou LUO, Sheng-fei SHI, Hong-zhi WANG, Jian-zhong LI
FrepJoin: an efficient partition-based algorithm for edit similarity join
String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
String similarity join / Edit distance / Filter and refine / Data partition / Combined frequency vectors
[1] |
Afrati , F.N., Sarma , A.D., Menestrina , D.,
|
[2] |
Arasu , A., Ganti , V., Kaushik , R., 2006. Efficient exact set-similarity joins. Int. Conf. on Very Large Data Bases, p.918–929.
|
[3] |
Bayardo , R.J., Ma , Y., Srikant , R., 2007. Scaling up all pairs similarity search. Int. World Wide Web Conf., p.131–140. https://doi.org/10.1145/1242572.1242591
|
[4] |
Chaudhuri , S., Ganjam , K., Ganti , V.,
|
[5] |
Chaudhuri , S., Ganti , V., Kaushik , R., 2006a. Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull. , 29(2):60–66.
|
[6] |
Chaudhuri , S., Ganti , V., Kaushik , R., 2006b. A primitive operator for similarity joins in data cleaning. Int. Conf. on Data Engineering, p.687–698. https://doi.org/10.1109/ICDE.2006.9
|
[7] |
Dong , X., Halevy , A.Y., Yu , C., 2007. Data integration with uncertainty. Int. Conf. on Very Large Data Bases, p.687–698.
|
[8] |
Feng , J., Wang , J., Li , G., 2012. Trie-join: a trie-based method for efficient string similarity joins. VLDB J. , 21(4):437–461. https://doi.org/10.1007/s00778-011-0252-8
|
[9] |
Ge , T., Li , Z., 2011. Approximate substring matching over uncertain strings. Proc. VLDB Endow. , 4(11):772–782.
|
[10] |
Gravano , L., Ipeirotis , P.G., Jagadish , H.V.,
|
[11] |
Hadjieleftheriou , M., Srivastava , D., 2010. Weighted Set-Based String Similarity. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. AT&T Lab-Research.
|
[12] |
Henzinger , M.R., 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.284–291. https://doi.org/10.1145/1148170.1148222
|
[13] |
Ji , S., Li , G., Li , C.,
|
[14] |
Li , C., Lu , J., Lu , Y., 2008. Efficient merging and filtering algorithms for approximate string searches. Int. Conf. on Data Engineering, p.257–266. https://doi.org/10.1109/ICDE.2008.4497434
|
[15] |
Li , G., Deng , D., Wang , J.,
|
[16] |
Metwally , A., Agrawal , D., Abbadi , A.E., 2007. Detectives: detecting coalition hit inflation attacks in advertising networks streams. Int. World Wide Web Conf., p.241–250. https://doi.org/10.1145/1242572.1242606
|
[17] |
Navarro , G., Salmela , L., 2009. Indexing variable length substrings for exact and approximate matching. Int. Symp. on String Processing and Information Retrieval, p.214–221. https://doi.org/10.1007/978-3-642-03784-9_21
|
[18] |
Qin , J., Wang , W., Xiao , C.,
|
[19] |
Sarawagi , S., Kirpal , A., 2004. Efficient set joins on similarity predicates. Int. SIGMOD Conf. on Management of Data, p.743–754. https://doi.org/10.1145/1007568.1007652
|
[20] |
Scott , D.W., 1979. On optimal and data-based histograms. Biometrika, 66:605–610. https://doi.org/10.1093/biomet/66.3.605
|
[21] |
Vernica , R., Carey , M.J., Li , C., 2010. Efficient parallel set-similarity joins using MapReduce. Int. SIGMOD Conf. on Management of Data, p.495–506. https://doi.org/10.1145/1807167.1807222
|
[22] |
Wang , J., Li , G., Feng , J., 2011. Fast-join: an efficient method for fuzzy token matching based string similarity join. Int. Conf. on Data Engineering, p.458–469. https://doi.org/10.1109/ICDE.2011.5767865
|
[23] |
Wang , W., Xiao , C., Lin , X.,
|
[24] |
Xiao , C., Wang , W., Lin , X., 2008a. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. , 1(1):933–944. https://doi.org/10.14778/1453856.1453957
|
[25] |
Xiao , C., Wang , W., Lin , X.,
|
/
〈 | 〉 |