A genetic algorithm based entity resolution approach with active learning

Chenchen SUN, Derong SHEN, Yue KOU, Tiezheng NIE, Ge YU

PDF(451 KB)
PDF(451 KB)
Front. Comput. Sci. ›› 2017, Vol. 11 ›› Issue (1) : 147-159. DOI: 10.1007/s11704-015-5276-6
RESEARCH ARTICLE

A genetic algorithm based entity resolution approach with active learning

Author information +
History +

Abstract

Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing approaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based entity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes’ comparisons with proper thresholds. We use active learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity resolution approaches in accuracy.

Keywords

entity resolution / genetic algorithm / active learning / data quality / data integration

Cite this article

Download citation ▾
Chenchen SUN, Derong SHEN, Yue KOU, Tiezheng NIE, Ge YU. A genetic algorithm based entity resolution approach with active learning. Front. Comput. Sci., 2017, 11(1): 147‒159 https://doi.org/10.1007/s11704-015-5276-6

References

[1]
Chen J C, Chen Y G, Du X Y, Li C P, Lu J H, Zhao S Y, Zhou X. Big data challenge: a data management perspective. Frontiers of Computer Science, 2013, 7(2): 157–164
CrossRef Google scholar
[2]
Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
CrossRef Google scholar
[3]
Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
CrossRef Google scholar
[4]
Kopcke H, Rahm E. Frameworks for entity matching: a comparison. Data and Knowledge Engineering, 2010, 69(2): 197–210
CrossRef Google scholar
[5]
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278
CrossRef Google scholar
[6]
Monge A E, Elkan C. The field matching problem: algorithms and applications. In: Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining. 1996, 267–270
[7]
Pinheiro J C, Sun D X. Methods for linking and mining massive heterogeneous databases. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1998, 309–313
[8]
Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48
CrossRef Google scholar
[9]
Minton S N, Nanjo C, Knoblock C A, Michalowski M, Michelson M. A heterogeneous field matching method for record linkage. In: 5th IEEE International Conference on Data Mining. 2005, 8
CrossRef Google scholar
[10]
Sun C, Shen D, Kou Y, Nie T, Yu G. ERGP: A Combined Entity Resolution Approach with Genetic Programming. In: Proceedings of the 11th IEEE Web Information System and Application Conference. 2014, 215–220
CrossRef Google scholar
[11]
Li P, Dong X L, Maurino A, Srivastava D. Linking temporal records. Frontiers of Computer Science, 2012, 6(3): 293–312
CrossRef Google scholar
[12]
Sun C C, Shen D R, Kou Y, Nie T Z, Yu G. GB-JER: A Graph-Based Model for Joint Entity Resolution. In: Proceedings of the 20th International Conference on Database Systems for Advanced Applications. 2015, 458–473
[13]
Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633
CrossRef Google scholar
[14]
Winkler W E. Methods for record linkage and bayesian networks. Technical report, Series RRS2002/05. 2002
[15]
De Carvalho M G, Gonçalves M A, Laender A H F, Da Silva A S. Learning to deduplicate. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. 2006, 41–50
CrossRef Google scholar
[16]
De Carvalho M G, Laender A H, Gonçalves M A, Da Silva A S. Replica identification using genetic programming. In: Proceedings of the 2008 ACM Symposium on Applied Computing. 2008, 1801–1806
CrossRef Google scholar
[17]
De Carvalho M G, Laender A H, Gonçalves M A, Da Silva A S. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3): 399–412
CrossRef Google scholar
[18]
Isele R, Bizer C. Learning linkage rules using genetic programming. In: Proceedings of the 6th International Workshop on Ontology Matching. 2011, 13–24
[19]
Isele R, Bizer C. Learning expressive linkage rules using genetic programming. The Very Large Databases Endowment, 2012, 5(11): 1638–1649
CrossRef Google scholar
[20]
Banzhaf W, Nordin P, Keller R E, Francone F D. Genetic programming: an introduction. San Francisco: Morgan Kaufmann, 1998
CrossRef Google scholar
[21]
Poli R, Langdon W B, McPhee N F, Koza J R. A field guide to genetic programming. Lulu. com, 2008
[22]
Liere R, Tadepalli P. Active learning with committees for text categorization. In: Proceedings of the 14th National Conference on Artificial Intelligence. 1997, 591–596
[23]
Cohn D A, Atlas L, R Ladner. Improving generalization with active learning. Machine Learning, 1994, 15(2): 201–221
CrossRef Google scholar
[24]
Bellare K, Suresh I, Parameswaran A, Rastogi V. Active sampling for entity matching with guarantees. ACM Transactions on Knowledge Discovery from Data, 2013, 7(3): 12
CrossRef Google scholar
[25]
Arasu A, Gotz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 783–794
CrossRef Google scholar
[26]
Koza J R. Genetic Programming: on the Programming of Computers by Means of Natural Selection. Boston: MIT press, 1992
[27]
Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In: Proceedings of the 9th ACM SIGKDDWorkshop on Data Cleaning and Object Consolidation. 2003, 73–78
[28]
Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. New York: ACM press, 1999
[29]
Blickle T, Thiele L. A Comparison of Selection Schemes Used in Genetic Algorithms. TIK-ReportNo.11, 1995
[30]
Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the 2nd ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery. 1997, 23–29
[31]
Shannon C E. A mathematical theory of communication. Bell System Technical Journal, 1948, 27(3): 379–423
CrossRef Google scholar
[32]
Hassanzadeh O, Chiang F, Lee H C, Miller R J. Framework for evaluating clustering algorithms in duplicate detection. The Very Large Databases Endowment, 2009, 2(1): 1282–1293
CrossRef Google scholar

RIGHTS & PERMISSIONS

2016 Higher Education Press and Springer-Verlag Berlin Heidelberg
AI Summary AI Mindmap
PDF(451 KB)

Accesses

Citations

Detail

Sections
Recommended

/