FrepJoin: an efficient partition-based algorithm for edit similarity join

Ji-zhou LUO , Sheng-fei SHI , Hong-zhi WANG , Jian-zhong LI

Front. Inform. Technol. Electron. Eng ›› 2017, Vol. 18 ›› Issue (10) : 1499 -1510.

PDF (401KB)
Front. Inform. Technol. Electron. Eng ›› 2017, Vol. 18 ›› Issue (10) : 1499 -1510. DOI: 10.1631/FITEE.1601347
Article
Article

FrepJoin: an efficient partition-based algorithm for edit similarity join

Author information +
History +
PDF (401KB)

Abstract

String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

Keywords

String similarity join / Edit distance / Filter and refine / Data partition / Combined frequency vectors

Cite this article

Download citation ▾
Ji-zhou LUO, Sheng-fei SHI, Hong-zhi WANG, Jian-zhong LI. FrepJoin: an efficient partition-based algorithm for edit similarity join. Front. Inform. Technol. Electron. Eng, 2017, 18(10): 1499-1510 DOI:10.1631/FITEE.1601347

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Afrati , F.N., Sarma , A.D., Menestrina , D., , 2012. Fuzzy joins using MapReduce. Int. Conf. on Data Engineering, p.498–509.

[2]

Arasu , A., Ganti , V., Kaushik , R., 2006. Efficient exact set-similarity joins. Int. Conf. on Very Large Data Bases, p.918–929.

[3]

Bayardo , R.J., Ma , Y., Srikant , R., 2007. Scaling up all pairs similarity search. Int. World Wide Web Conf., p.131–140.

[4]

Chaudhuri , S., Ganjam , K., Ganti , V., , 2003. Robust and efficient fuzzy match for online data cleaning. Int. SIGMOD Conf. on Management of Data, p.313–324.

[5]

Chaudhuri , S., Ganti , V., Kaushik , R., 2006a. Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull. , 29(2):60–66.

[6]

Chaudhuri , S., Ganti , V., Kaushik , R., 2006b. A primitive operator for similarity joins in data cleaning. Int. Conf. on Data Engineering, p.687–698.

[7]

Dong , X., Halevy , A.Y., Yu , C., 2007. Data integration with uncertainty. Int. Conf. on Very Large Data Bases, p.687–698.

[8]

Feng , J., Wang , J., Li , G., 2012. Trie-join: a trie-based method for efficient string similarity joins. VLDB J. , 21(4):437–461.

[9]

Ge , T., Li , Z., 2011. Approximate substring matching over uncertain strings. Proc. VLDB Endow. , 4(11):772–782.

[10]

Gravano , L., Ipeirotis , P.G., Jagadish , H.V., , 2001. Approximate string joins in a database (almost) for free. Int. Conf. on Very Large Data Bases, p.491–500.

[11]

Hadjieleftheriou , M., Srivastava , D., 2010. Weighted Set-Based String Similarity. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. AT&T Lab-Research.

[12]

Henzinger , M.R., 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.284–291.

[13]

Ji , S., Li , G., Li , C., , 2009. Efficient interactive fuzzy keyword search. Int. World Wide Web Conf., p.371–380.

[14]

Li , C., Lu , J., Lu , Y., 2008. Efficient merging and filtering algorithms for approximate string searches. Int. Conf. on Data Engineering, p.257–266.

[15]

Li , G., Deng , D., Wang , J., , 2011. Pass-Join: a partition-based method for similarity joins. Proc. VLDB Endow. , 5(3):253–264.

[16]

Metwally , A., Agrawal , D., Abbadi , A.E., 2007. Detectives: detecting coalition hit inflation attacks in advertising networks streams. Int. World Wide Web Conf., p.241–250.

[17]

Navarro , G., Salmela , L., 2009. Indexing variable length substrings for exact and approximate matching. Int. Symp. on String Processing and Information Retrieval, p.214–221.

[18]

Qin , J., Wang , W., Xiao , C., , 2013. Vchunkjoin: an efficient algorithm for edit similarity joins. Trans. Knowl. Dat. Eng. , 25(8):1916–1929.

[19]

Sarawagi , S., Kirpal , A., 2004. Efficient set joins on similarity predicates. Int. SIGMOD Conf. on Management of Data, p.743–754.

[20]

Scott , D.W., 1979. On optimal and data-based histograms. Biometrika, 66:605–610.

[21]

Vernica , R., Carey , M.J., Li , C., 2010. Efficient parallel set-similarity joins using MapReduce. Int. SIGMOD Conf. on Management of Data, p.495–506.

[22]

Wang , J., Li , G., Feng , J., 2011. Fast-join: an efficient method for fuzzy token matching based string similarity join. Int. Conf. on Data Engineering, p.458–469.

[23]

Wang , W., Xiao , C., Lin , X., , 2009. Efficient approximate entity extraction with edit distance constraints. Int. SIGMOD Conf. on Management of Data, p.759–770.

[24]

Xiao , C., Wang , W., Lin , X., 2008a. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. , 1(1):933–944.

[25]

Xiao , C., Wang , W., Lin , X., , 2008b. Efficient similarity joins for near duplicate detection. Int. World Wide Web Conf., p.131–140.

RIGHTS & PERMISSIONS

Zhejiang University and Springer-Verlag GmbH Germany

AI Summary AI Mindmap
PDF (401KB)

Supplementary files

FITEE-1499-17005-JZL_suppl_1

FITEE-1499-17005-JZL_suppl_2

2429

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/