Near-duplicate document detection with improved similarity measurement
Xin-pan Yuan , Jun Long , Zu-ping Zhang , Wei-hua Gui
Journal of Central South University ›› 2012, Vol. 19 ›› Issue (8) : 2231 -2237.
Near-duplicate document detection with improved similarity measurement
To quickly find documents with high similarity in existing documentation sets, fingerprint group merging retrieval algorithm is proposed to address both sides of the problem: a given similarity threshold could not be too low and fewer fingerprints could lead to low accuracy. It can be proved that the efficiency of similarity retrieval is improved by fingerprint group merging retrieval algorithm with lower similarity threshold. Experiments with the lower similarity threshold r=0.7 and high fingerprint bits k=400 demonstrate that the CPU time-consuming cost decreases from 1 921 s to 273 s. Theoretical analysis and experimental results verify the effectiveness of this method.
similarity estimation / near-duplicate document detection / fingerprint group / Hamming distance / minwise hashing
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
/
| 〈 |
|
〉 |