FrepJoin: an efficient partition-based algorithm for edit similarity join

Ji-zhou LUO, Sheng-fei SHI, Hong-zhi WANG, Jian-zhong LI

PDF(401 KB)
PDF(401 KB)
Front. Inform. Technol. Electron. Eng ›› 2017, Vol. 18 ›› Issue (10) : 1499-1510. DOI: 10.1631/FITEE.1601347
Article
Article

FrepJoin: an efficient partition-based algorithm for edit similarity join

Author information +
History +

Abstract

String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

Keywords

String similarity join / Edit distance / Filter and refine / Data partition / Combined frequency vectors

Cite this article

Download citation ▾
Ji-zhou LUO, Sheng-fei SHI, Hong-zhi WANG, Jian-zhong LI. FrepJoin: an efficient partition-based algorithm for edit similarity join. Front. Inform. Technol. Electron. Eng, 2017, 18(10): 1499‒1510 https://doi.org/10.1631/FITEE.1601347

References

[1]
Afrati , F.N., Sarma , A.D., Menestrina , D., , 2012. Fuzzy joins using MapReduce. Int. Conf. on Data Engineering, p.498–509. https://doi.org/10.1109/ICDE.2012.66
[2]
Arasu , A., Ganti , V., Kaushik , R., 2006. Efficient exact set-similarity joins. Int. Conf. on Very Large Data Bases, p.918–929.
[3]
Bayardo , R.J., Ma , Y., Srikant , R., 2007. Scaling up all pairs similarity search. Int. World Wide Web Conf., p.131–140. https://doi.org/10.1145/1242572.1242591
[4]
Chaudhuri , S., Ganjam , K., Ganti , V., , 2003. Robust and efficient fuzzy match for online data cleaning. Int. SIGMOD Conf. on Management of Data, p.313–324. https://doi.org/10.1145/872757.872796
[5]
Chaudhuri , S., Ganti , V., Kaushik , R., 2006a. Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull. , 29(2):60–66.
[6]
Chaudhuri , S., Ganti , V., Kaushik , R., 2006b. A primitive operator for similarity joins in data cleaning. Int. Conf. on Data Engineering, p.687–698. https://doi.org/10.1109/ICDE.2006.9
[7]
Dong , X., Halevy , A.Y., Yu , C., 2007. Data integration with uncertainty. Int. Conf. on Very Large Data Bases, p.687–698.
[8]
Feng , J., Wang , J., Li , G., 2012. Trie-join: a trie-based method for efficient string similarity joins. VLDB J. , 21(4):437–461. https://doi.org/10.1007/s00778-011-0252-8
[9]
Ge , T., Li , Z., 2011. Approximate substring matching over uncertain strings. Proc. VLDB Endow. , 4(11):772–782.
[10]
Gravano , L., Ipeirotis , P.G., Jagadish , H.V., , 2001. Approximate string joins in a database (almost) for free. Int. Conf. on Very Large Data Bases, p.491–500.
[11]
Hadjieleftheriou , M., Srivastava , D., 2010. Weighted Set-Based String Similarity. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. AT&T Lab-Research.
[12]
Henzinger , M.R., 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.284–291. https://doi.org/10.1145/1148170.1148222
[13]
Ji , S., Li , G., Li , C., , 2009. Efficient interactive fuzzy keyword search. Int. World Wide Web Conf., p.371–380. https://doi.org/10.1145/1526709.1526760
[14]
Li , C., Lu , J., Lu , Y., 2008. Efficient merging and filtering algorithms for approximate string searches. Int. Conf. on Data Engineering, p.257–266. https://doi.org/10.1109/ICDE.2008.4497434
[15]
Li , G., Deng , D., Wang , J., , 2011. Pass-Join: a partition-based method for similarity joins. Proc. VLDB Endow. , 5(3):253–264. https://doi.org/10.14778/2078331.2078340
[16]
Metwally , A., Agrawal , D., Abbadi , A.E., 2007. Detectives: detecting coalition hit inflation attacks in advertising networks streams. Int. World Wide Web Conf., p.241–250. https://doi.org/10.1145/1242572.1242606
[17]
Navarro , G., Salmela , L., 2009. Indexing variable length substrings for exact and approximate matching. Int. Symp. on String Processing and Information Retrieval, p.214–221. https://doi.org/10.1007/978-3-642-03784-9_21
[18]
Qin , J., Wang , W., Xiao , C., , 2013. Vchunkjoin: an efficient algorithm for edit similarity joins. Trans. Knowl. Dat. Eng. , 25(8):1916–1929. https://doi.org/10.1109/TKDE.2012.79
[19]
Sarawagi , S., Kirpal , A., 2004. Efficient set joins on similarity predicates. Int. SIGMOD Conf. on Management of Data, p.743–754. https://doi.org/10.1145/1007568.1007652
[20]
Scott , D.W., 1979. On optimal and data-based histograms. Biometrika, 66:605–610. https://doi.org/10.1093/biomet/66.3.605
[21]
Vernica , R., Carey , M.J., Li , C., 2010. Efficient parallel set-similarity joins using MapReduce. Int. SIGMOD Conf. on Management of Data, p.495–506. https://doi.org/10.1145/1807167.1807222
[22]
Wang , J., Li , G., Feng , J., 2011. Fast-join: an efficient method for fuzzy token matching based string similarity join. Int. Conf. on Data Engineering, p.458–469. https://doi.org/10.1109/ICDE.2011.5767865
[23]
Wang , W., Xiao , C., Lin , X., , 2009. Efficient approximate entity extraction with edit distance constraints. Int. SIGMOD Conf. on Management of Data, p.759–770. https://doi.org/10.1145/1559845.1559925
[24]
Xiao , C., Wang , W., Lin , X., 2008a. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. , 1(1):933–944. https://doi.org/10.14778/1453856.1453957
[25]
Xiao , C., Wang , W., Lin , X., , 2008b. Efficient similarity joins for near duplicate detection. Int. World Wide Web Conf., p.131–140. https://doi.org/10.1145/2000824.2000825

RIGHTS & PERMISSIONS

2017 Zhejiang University and Springer-Verlag GmbH Germany
PDF(401 KB)

Accesses

Citations

Detail

Sections
Recommended

/