A genetic algorithm based entity resolution approach with active learning

Chenchen SUN; Derong SHEN; Yue KOU; Tiezheng NIE; Ge YU

doi:10.1007/s11704-015-5276-6

Front. Comput. Sci. ›› 2017, Vol. 11 ›› Issue (1) :147 -159. DOI: 10.1007/s11704-015-5276-6

RESEARCH ARTICLE

A genetic algorithm based entity resolution approach with active learning

Author information +

History +

PDF (417KB)

Abstract

Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing approaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based entity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes’ comparisons with proper thresholds. We use active learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity resolution approaches in accuracy.

Keywords

entity resolution / genetic algorithm / active learning / data quality / data integration

Cite this article

Download citation ▾

Chenchen SUN, Derong SHEN, Yue KOU, Tiezheng NIE, Ge YU. A genetic algorithm based entity resolution approach with active learning. Front. Comput. Sci., 2017, 11(1): 147-159 DOI:10.1007/s11704-015-5276-6

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Chen J C, Chen Y G, Du X Y, Li C P, Lu J H, Zhao S Y, Zhou X. Big data challenge: a data management perspective. Frontiers of Computer Science, 2013, 7(2): 157–164

[2]	Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210

[3]	Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16

[4]	Kopcke H, Rahm E. Frameworks for entity matching: a comparison. Data and Knowledge Engineering, 2010, 69(2): 197–210

[5]	Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278

[6]	Monge A E, Elkan C. The field matching problem: algorithms and applications. In: Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining. 1996, 267–270

[7]	Pinheiro J C, Sun D X. Methods for linking and mining massive heterogeneous databases. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1998, 309–313

[8]	Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48

[9]	Minton S N, Nanjo C, Knoblock C A, Michalowski M, Michelson M. A heterogeneous field matching method for record linkage. In: 5th IEEE International Conference on Data Mining. 2005, 8

[10]	Sun C, Shen D, Kou Y, Nie T, Yu G. ERGP: A Combined Entity Resolution Approach with Genetic Programming. In: Proceedings of the 11th IEEE Web Information System and Application Conference. 2014, 215–220

[11]	Li P, Dong X L, Maurino A, Srivastava D. Linking temporal records. Frontiers of Computer Science, 2012, 6(3): 293–312

[12]	Sun C C, Shen D R, Kou Y, Nie T Z, Yu G. GB-JER: A Graph-Based Model for Joint Entity Resolution. In: Proceedings of the 20th International Conference on Database Systems for Advanced Applications. 2015, 458–473

[13]	Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633

[14]	Winkler W E. Methods for record linkage and bayesian networks. Technical report, Series RRS2002/05. 2002

[15]	De Carvalho M G, Gonçalves M A, Laender A H F, Da Silva A S. Learning to deduplicate. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. 2006, 41–50

[16]	De Carvalho M G, Laender A H, Gonçalves M A, Da Silva A S. Replica identification using genetic programming. In: Proceedings of the 2008 ACM Symposium on Applied Computing. 2008, 1801–1806

[17]	De Carvalho M G, Laender A H, Gonçalves M A, Da Silva A S. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3): 399–412

[18]	Isele R, Bizer C. Learning linkage rules using genetic programming. In: Proceedings of the 6th International Workshop on Ontology Matching. 2011, 13–24

[19]	Isele R, Bizer C. Learning expressive linkage rules using genetic programming. The Very Large Databases Endowment, 2012, 5(11): 1638–1649

[20]	Banzhaf W, Nordin P, Keller R E, Francone F D. Genetic programming: an introduction. San Francisco: Morgan Kaufmann, 1998

[21]	Poli R, Langdon W B, McPhee N F, Koza J R. A field guide to genetic programming. Lulu. com, 2008

[22]	Liere R, Tadepalli P. Active learning with committees for text categorization. In: Proceedings of the 14th National Conference on Artificial Intelligence. 1997, 591–596

[23]	Cohn D A, Atlas L, R Ladner. Improving generalization with active learning. Machine Learning, 1994, 15(2): 201–221

[24]	Bellare K, Suresh I, Parameswaran A, Rastogi V. Active sampling for entity matching with guarantees. ACM Transactions on Knowledge Discovery from Data, 2013, 7(3): 12

[25]	Arasu A, Gotz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 783–794

[26]	Koza J R. Genetic Programming: on the Programming of Computers by Means of Natural Selection. Boston: MIT press, 1992

[27]	Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In: Proceedings of the 9th ACM SIGKDDWorkshop on Data Cleaning and Object Consolidation. 2003, 73–78

[28]	Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. New York: ACM press, 1999

[29]	Blickle T, Thiele L. A Comparison of Selection Schemes Used in Genetic Algorithms. TIK-ReportNo.11, 1995

[30]	Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the 2nd ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery. 1997, 23–29

[31]	Shannon C E. A mathematical theory of communication. Bell System Technical Journal, 1948, 27(3): 379–423

[32]	Hassanzadeh O, Chiang F, Lee H C, Miller R J. Framework for evaluating clustering algorithms in duplicate detection. The Very Large Databases Endowment, 2009, 2(1): 1282–1293