FrepJoin: an efficient partition-based algorithm for edit similarity join
Ji-zhou LUO , Sheng-fei SHI , Hong-zhi WANG , Jian-zhong LI
Front. Inform. Technol. Electron. Eng ›› 2017, Vol. 18 ›› Issue (10) : 1499 -1510.
FrepJoin: an efficient partition-based algorithm for edit similarity join
String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
String similarity join / Edit distance / Filter and refine / Data partition / Combined frequency vectors
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
Zhejiang University and Springer-Verlag GmbH Germany
/
| 〈 |
|
〉 |