FrepJoin: an efficient partition-based algorithm for edit similarity join

Ji-zhou LUO; Sheng-fei SHI; Hong-zhi WANG; Jian-zhong LI

doi:10.1631/FITEE.1601347

PDF(401 KB)

Front. Inform. Technol. Electron. Eng ›› 2017, Vol. 18 ›› Issue (10) : 1499-1510. DOI: 10.1631/FITEE.1601347

Article

FrepJoin: an efficient partition-based algorithm for edit similarity join

Author information +

History +

Abstract

String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

Keywords

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Ji-zhou LUO, Sheng-fei SHI, Hong-zhi WANG, Jian-zhong LI. FrepJoin: an efficient partition-based algorithm for edit similarity join. Front. Inform. Technol. Electron. Eng, 2017, 18(10): 1499‒1510 https://doi.org/10.1631/FITEE.1601347

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Afrati , F.N., Sarma , A.D., Menestrina , D., , 2012. Fuzzy joins using MapReduce. Int. Conf. on Data Engineering, p.498–509. https://doi.org/10.1109/ICDE.2012.66

[2]	Arasu , A., Ganti , V., Kaushik , R., 2006. Efficient exact set-similarity joins. Int. Conf. on Very Large Data Bases, p.918–929.

[3]	Bayardo , R.J., Ma , Y., Srikant , R., 2007. Scaling up all pairs similarity search. Int. World Wide Web Conf., p.131–140. https://doi.org/10.1145/1242572.1242591

[4]	Chaudhuri , S., Ganjam , K., Ganti , V., , 2003. Robust and efficient fuzzy match for online data cleaning. Int. SIGMOD Conf. on Management of Data, p.313–324. https://doi.org/10.1145/872757.872796

[5]	Chaudhuri , S., Ganti , V., Kaushik , R., 2006a. Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull. , 29(2):60–66.

[6]	Chaudhuri , S., Ganti , V., Kaushik , R., 2006b. A primitive operator for similarity joins in data cleaning. Int. Conf. on Data Engineering, p.687–698. https://doi.org/10.1109/ICDE.2006.9

[7]	Dong , X., Halevy , A.Y., Yu , C., 2007. Data integration with uncertainty. Int. Conf. on Very Large Data Bases, p.687–698.

[8]	Feng , J., Wang , J., Li , G., 2012. Trie-join: a trie-based method for efficient string similarity joins. VLDB J. , 21(4):437–461. https://doi.org/10.1007/s00778-011-0252-8

[9]	Ge , T., Li , Z., 2011. Approximate substring matching over uncertain strings. Proc. VLDB Endow. , 4(11):772–782.

[10]	Gravano , L., Ipeirotis , P.G., Jagadish , H.V., , 2001. Approximate string joins in a database (almost) for free. Int. Conf. on Very Large Data Bases, p.491–500.

[11]	Hadjieleftheriou , M., Srivastava , D., 2010. Weighted Set-Based String Similarity. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. AT&T Lab-Research.

[12]	Henzinger , M.R., 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.284–291. https://doi.org/10.1145/1148170.1148222

[13]	Ji , S., Li , G., Li , C., , 2009. Efficient interactive fuzzy keyword search. Int. World Wide Web Conf., p.371–380. https://doi.org/10.1145/1526709.1526760

[14]	Li , C., Lu , J., Lu , Y., 2008. Efficient merging and filtering algorithms for approximate string searches. Int. Conf. on Data Engineering, p.257–266. https://doi.org/10.1109/ICDE.2008.4497434

[15]	Li , G., Deng , D., Wang , J., , 2011. Pass-Join: a partition-based method for similarity joins. Proc. VLDB Endow. , 5(3):253–264. https://doi.org/10.14778/2078331.2078340

[16]	Metwally , A., Agrawal , D., Abbadi , A.E., 2007. Detectives: detecting coalition hit inflation attacks in advertising networks streams. Int. World Wide Web Conf., p.241–250. https://doi.org/10.1145/1242572.1242606

[17]	Navarro , G., Salmela , L., 2009. Indexing variable length substrings for exact and approximate matching. Int. Symp. on String Processing and Information Retrieval, p.214–221. https://doi.org/10.1007/978-3-642-03784-9_21

[18]	Qin , J., Wang , W., Xiao , C., , 2013. Vchunkjoin: an efficient algorithm for edit similarity joins. Trans. Knowl. Dat. Eng. , 25(8):1916–1929. https://doi.org/10.1109/TKDE.2012.79

[19]	Sarawagi , S., Kirpal , A., 2004. Efficient set joins on similarity predicates. Int. SIGMOD Conf. on Management of Data, p.743–754. https://doi.org/10.1145/1007568.1007652

[20]	Scott , D.W., 1979. On optimal and data-based histograms. Biometrika, 66:605–610. https://doi.org/10.1093/biomet/66.3.605

[21]	Vernica , R., Carey , M.J., Li , C., 2010. Efficient parallel set-similarity joins using MapReduce. Int. SIGMOD Conf. on Management of Data, p.495–506. https://doi.org/10.1145/1807167.1807222

[22]	Wang , J., Li , G., Feng , J., 2011. Fast-join: an efficient method for fuzzy token matching based string similarity join. Int. Conf. on Data Engineering, p.458–469. https://doi.org/10.1109/ICDE.2011.5767865

[23]	Wang , W., Xiao , C., Lin , X., , 2009. Efficient approximate entity extraction with edit distance constraints. Int. SIGMOD Conf. on Management of Data, p.759–770. https://doi.org/10.1145/1559845.1559925

[24]	Xiao , C., Wang , W., Lin , X., 2008a. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. , 1(1):933–944. https://doi.org/10.14778/1453856.1453957

[25]	Xiao , C., Wang , W., Lin , X., , 2008b. Efficient similarity joins for near duplicate detection. Int. World Wide Web Conf., p.131–140. https://doi.org/10.1145/2000824.2000825