Active transfer learning of matching query results across multiple sources

Jie XIN, Zhiming CUI, Pengpeng ZHAO, Tianxu HE

PDF(562 KB)
PDF(562 KB)
Front. Comput. Sci. ›› 2015, Vol. 9 ›› Issue (4) : 595-607. DOI: 10.1007/s11704-015-4068-3
RESEARCH ARTICLE

Active transfer learning of matching query results across multiple sources

Author information +
History +

Abstract

Entity resolution (ER) is the problem of identifying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under supervised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Although such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sampling strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classifiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our experimental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewer labeled samples for record matching with numerous and varied sources.

Keywords

entity resolution / active learning / transfer learning / convex optimization

Cite this article

Download citation ▾
Jie XIN, Zhiming CUI, Pengpeng ZHAO, Tianxu HE. Active transfer learning of matching query results across multiple sources. Front. Comput. Sci., 2015, 9(4): 595‒607 https://doi.org/10.1007/s11704-015-4068-3

References

[1]
Getoor L, Machanavajjhala A. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018―2019
CrossRef Google scholar
[2]
Negahban N, Rubinstein P, Gemmell G. Scaling multiple-source entity resolution using statistically efficient transfer learning. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012, 2224―2228
CrossRef Google scholar
[3]
Arasu A, Götz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 783―794
CrossRef Google scholar
[4]
Bellare K, Iyengar S, Parameswaran A, Rastogi V. Active sampling for entity matching. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 1131―1139
CrossRef Google scholar
[5]
Chuang S L, Chang K C C. Integrating web query results: holistic schema matching. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008, 33―42
CrossRef Google scholar
[6]
Köpcke H, Rahm E. Frameworks for entity matching: a comparison. Data & Knowledge Engineering, 2010, 69(2): 197―210
CrossRef Google scholar
[7]
Winkler W E. The state of record linkage and current research problems. In: Proceedings of Statistical Research Division, US Census Bureau. 1999
[8]
Chaudhuri S, Chen B C, Ganti V, Kaushik R. Example-driven design of efficient record matching queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 327―338
[9]
Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39―48
CrossRef Google scholar
[10]
Su W, Wang J, Lochovsky F H. Record matching over query results from multiple web databases. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(4): 578―589
CrossRef Google scholar
[11]
Köpcke H, Rahm E. Training selection for tuning entity matching. In: Proceedings of QDB/MUD. 2008, 3―12
[12]
Altwaijry H, Kalashnikov D V, Mehrotra S. Query-driven approach to entity resolution. Proceedings of the VLDB Endowment, 2013, 6(14): 1846―1857
CrossRef Google scholar
[13]
Singla P, Domingos P. Entity resolution with Markov logic. In: Proceedings of International Conference on Data Mining. 2006, 572―582
CrossRef Google scholar
[14]
Liu W, Xiao J G. A duplicate web entity identification approach based on iterative training. Frontiers of Computer Science and Technology, 2010, (007): 599―607
[15]
Wang J, Kraska T, Franklin M J, Feng J. Crowder: crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 2012, 5(11): 1483―1494
CrossRef Google scholar
[16]
Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345―1359
CrossRef Google scholar
[17]
Yang L, Hanneke S, Carbonell J. A theory of transfer learning with applications to active learning. Machine Learning, 2013, 90(2): 161―189
CrossRef Google scholar
[18]
Shi X, Fan W, Ren J. Actively transfer domain knowledge. In: Proceedings of ECML/PKDD. 2008, 342―357
CrossRef Google scholar
[19]
Zhao L, Pan S J, Xiang E W, Zhong E, Lu Z, Yang Q. Active transfer learning for cross-system recommendation. In: Proceedings of the 27th AAAI Conference on Artificial Intelogence. 2013, 1205―1211
[20]
Fang M, Yin J, Zhu X. Knowledge transfer for multi-labeler active learning. Lecture Notes in Computer Science, 2013, 8188: 273―288
CrossRef Google scholar
[21]
Jun G, Ghosh J. An efficient active learning algorithm with knowledge transfer for hyperspectral data analysis. In: Proceedings of Geoscience and Remote Sensing Symposium. 2008, 1: I-52―I-55
CrossRef Google scholar
[22]
Li L, Jin X, Pan S J, Sun J T. Multi-domain active learning for text classification. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 1086―1094
CrossRef Google scholar
[23]
Christen P. Automatic record linkage using seeded nearest neighbor and support vector machine classification. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 151―159
CrossRef Google scholar
[24]
Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang S E, Widom J. Swoosh: a generic approach to entity resolution. The International Journal on Very Large Data Bases, 2009, 18(1): 255―276
CrossRef Google scholar
[25]
Boyd S P, Vandenberghe L. Convex Optimization. Cambridge University Press, 2004
CrossRef Google scholar
[26]
Jalali A, Ravikumar P D, Sanghavi S, Ruan C. A dirty model for multitask learning. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 964―972
[27]
Bickel P J, Ritov Y A, Tsybakov A B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 2009, 37(4): 1705―1732
CrossRef Google scholar
[28]
Tong S. Active Learning: Theory and Applications. Stanford University, 2001
[29]
Davis J, Goadrich M. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 233―240
CrossRef Google scholar

RIGHTS & PERMISSIONS

2014 Higher Education Press and Springer-Verlag Berlin Heidelberg
AI Summary AI Mindmap
PDF(562 KB)

Accesses

Citations

Detail

Sections
Recommended

/