A genetic algorithm based entity resolution approach with active learning
Chenchen SUN, Derong SHEN, Yue KOU, Tiezheng NIE, Ge YU
A genetic algorithm based entity resolution approach with active learning
Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing approaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based entity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes’ comparisons with proper thresholds. We use active learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity resolution approaches in accuracy.
entity resolution / genetic algorithm / active learning / data quality / data integration
[1] |
Chen J C, Chen Y G, Du X Y, Li C P, Lu J H, Zhao S Y, Zhou X. Big data challenge: a data management perspective. Frontiers of Computer Science, 2013, 7(2): 157–164
CrossRef
Google scholar
|
[2] |
Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
CrossRef
Google scholar
|
[3] |
Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
CrossRef
Google scholar
|
[4] |
Kopcke H, Rahm E. Frameworks for entity matching: a comparison. Data and Knowledge Engineering, 2010, 69(2): 197–210
CrossRef
Google scholar
|
[5] |
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278
CrossRef
Google scholar
|
[6] |
Monge A E, Elkan C. The field matching problem: algorithms and applications. In: Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining. 1996, 267–270
|
[7] |
Pinheiro J C, Sun D X. Methods for linking and mining massive heterogeneous databases. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1998, 309–313
|
[8] |
Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48
CrossRef
Google scholar
|
[9] |
Minton S N, Nanjo C, Knoblock C A, Michalowski M, Michelson M. A heterogeneous field matching method for record linkage. In: 5th IEEE International Conference on Data Mining. 2005, 8
CrossRef
Google scholar
|
[10] |
Sun C, Shen D, Kou Y, Nie T, Yu G. ERGP: A Combined Entity Resolution Approach with Genetic Programming. In: Proceedings of the 11th IEEE Web Information System and Application Conference. 2014, 215–220
CrossRef
Google scholar
|
[11] |
Li P, Dong X L, Maurino A, Srivastava D. Linking temporal records. Frontiers of Computer Science, 2012, 6(3): 293–312
CrossRef
Google scholar
|
[12] |
Sun C C, Shen D R, Kou Y, Nie T Z, Yu G. GB-JER: A Graph-Based Model for Joint Entity Resolution. In: Proceedings of the 20th International Conference on Database Systems for Advanced Applications. 2015, 458–473
|
[13] |
Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633
CrossRef
Google scholar
|
[14] |
Winkler W E. Methods for record linkage and bayesian networks. Technical report, Series RRS2002/05. 2002
|
[15] |
De Carvalho M G, Gonçalves M A, Laender A H F, Da Silva A S. Learning to deduplicate. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. 2006, 41–50
CrossRef
Google scholar
|
[16] |
De Carvalho M G, Laender A H, Gonçalves M A, Da Silva A S. Replica identification using genetic programming. In: Proceedings of the 2008 ACM Symposium on Applied Computing. 2008, 1801–1806
CrossRef
Google scholar
|
[17] |
De Carvalho M G, Laender A H, Gonçalves M A, Da Silva A S. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3): 399–412
CrossRef
Google scholar
|
[18] |
Isele R, Bizer C. Learning linkage rules using genetic programming. In: Proceedings of the 6th International Workshop on Ontology Matching. 2011, 13–24
|
[19] |
Isele R, Bizer C. Learning expressive linkage rules using genetic programming. The Very Large Databases Endowment, 2012, 5(11): 1638–1649
CrossRef
Google scholar
|
[20] |
Banzhaf W, Nordin P, Keller R E, Francone F D. Genetic programming: an introduction. San Francisco: Morgan Kaufmann, 1998
CrossRef
Google scholar
|
[21] |
Poli R, Langdon W B, McPhee N F, Koza J R. A field guide to genetic programming. Lulu. com, 2008
|
[22] |
Liere R, Tadepalli P. Active learning with committees for text categorization. In: Proceedings of the 14th National Conference on Artificial Intelligence. 1997, 591–596
|
[23] |
Cohn D A, Atlas L, R Ladner. Improving generalization with active learning. Machine Learning, 1994, 15(2): 201–221
CrossRef
Google scholar
|
[24] |
Bellare K, Suresh I, Parameswaran A, Rastogi V. Active sampling for entity matching with guarantees. ACM Transactions on Knowledge Discovery from Data, 2013, 7(3): 12
CrossRef
Google scholar
|
[25] |
Arasu A, Gotz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 783–794
CrossRef
Google scholar
|
[26] |
Koza J R. Genetic Programming: on the Programming of Computers by Means of Natural Selection. Boston: MIT press, 1992
|
[27] |
Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In: Proceedings of the 9th ACM SIGKDDWorkshop on Data Cleaning and Object Consolidation. 2003, 73–78
|
[28] |
Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. New York: ACM press, 1999
|
[29] |
Blickle T, Thiele L. A Comparison of Selection Schemes Used in Genetic Algorithms. TIK-ReportNo.11, 1995
|
[30] |
Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the 2nd ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery. 1997, 23–29
|
[31] |
Shannon C E. A mathematical theory of communication. Bell System Technical Journal, 1948, 27(3): 379–423
CrossRef
Google scholar
|
[32] |
Hassanzadeh O, Chiang F, Lee H C, Miller R J. Framework for evaluating clustering algorithms in duplicate detection. The Very Large Databases Endowment, 2009, 2(1): 1282–1293
CrossRef
Google scholar
|
/
〈 | 〉 |