Near-duplicate document detection with improved similarity measurement

Xin-pan Yuan; Jun Long; Zu-ping Zhang; Wei-hua Gui

doi:10.1007/s11771-012-1267-z

Journal of Central South University ›› 2012, Vol. 19 ›› Issue (8) :2231 -2237. DOI: 10.1007/s11771-012-1267-z

Article

Near-duplicate document detection with improved similarity measurement

Author information +

History +

PDF

Abstract

To quickly find documents with high similarity in existing documentation sets, fingerprint group merging retrieval algorithm is proposed to address both sides of the problem: a given similarity threshold could not be too low and fewer fingerprints could lead to low accuracy. It can be proved that the efficiency of similarity retrieval is improved by fingerprint group merging retrieval algorithm with lower similarity threshold. Experiments with the lower similarity threshold r=0.7 and high fingerprint bits k=400 demonstrate that the CPU time-consuming cost decreases from 1 921 s to 273 s. Theoretical analysis and experimental results verify the effectiveness of this method.

Keywords

Cite this article

Download citation ▾

Xin-pan Yuan, Jun Long, Zu-ping Zhang, Wei-hua Gui. Near-duplicate document detection with improved similarity measurement. Journal of Central South University, 2012, 19 (8) : 2231-2237 DOI:10.1007/s11771-012-1267-z

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	BroderA. Z., CharikarM., FriezeA. M., MitzenmacherM.. Min-wise independent permutations [J]. Journal of Computer Systems and Sciences, 2000, 60(3): 630-659

[2]	BroderA. Z.. Identifying and filtering near-duplicate documents [C]. Proceeding COM ′00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, 2000LondonSpringer-Verlag1-10

[3]	BroderA. Z.. On the resemblance and containment of documents [C]. Proceedings of Compression and Complexity of Sequences, 1997Washington, DC, USAIEEE Computer Society21-29

[4]	KalpakisK., TangS.. Collaborative data gathering in wireless sensor networks using measurement co-occurrence [J]. Computer Communications, 2008, 31(10): 1979-1992

[5]	DourisboureY., GeraciF., PellegriniM.. Extraction and classification of dense implicit communities in the web graph [J]. ACM Transactions on the Web (TWEB), 2009, 3(2): 1-36

[6]	BenderskyM., CroftW. B.. Finding text reuse on the web [C]. WSDM ′09 Proceedings of the Second ACM International Conference on Web Search and Data Mining, 2009New York, USAACM262-271

[7]	BuehrerG., ChellapillaK.. A scalable pattern mining approach to web graph compression with communities [C]. WSDM ′08 Proceedings of the International Conference on Web Search and Web Data Mining, 2008New York, USAACM95-106

[8]	IndykP.. A small approximately min-wise independent family of hash functions [J]. Journal of Algorithm, 2001, 38(1): 84-90

[9]	CharikarM. S.. Similarity estimation techniques from rounding algorithms [C]. STOC ′02 Proceedings of the Thirty-fourth Annual ACM Symposium on Theory of Computing, 2002New York, USAACM380-388

[10]	LiP., KÖNIGA. C.. b-Bit minwise hashing [C]. www ′10 Proceedings of the 19th International Conference on World Wide Web, 2010New York, USAACM671-680

[11]	LiP., KÖNIGA. C.. Theory and applications of b-Bit minwise hashing [J]. Communications of the ACM, 2011, 54(8): 101-109

[12]	LiP., MooreJ., KÖNIGA. C.. Hashing algorithms for large-scale learning [C]. Neural Information Processing Systems (NIPS) BC, 2011CanadaVancouver35-44

[13]	BachrachY., HerbrichR.. Fingerprinting ratings for collaborative filtering theoretical and empirical analysis [C]. SPIRE ′10 Proceedings of the 17th International Conference on String Processing and Information Retrieval, 2010Berlin, HeidelbergSpringer-Verlag25-36

[14]	FeigenblatG., PoratE., ShiftanA.. Exponential time improvement for min-wise based algorithms [C]. SODA ′11 Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, 2011USASIAM57-66

[15]	BachrachY., HerbrichR., PoratE.. Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems [C]. SPIRE ′09: Proceedings of the 16th International Symposium on String Processing and Information Retrieval, 2009London, UKSpringer-Verlag344-352

[16]	MankuG. S., JainA., SarmaA. D.. Detecting near-duplicates for web crawling [C]. www ′07 Proceedings of the 16th International Conference on World Wide Web, 2007New York, USAACM141-150