MapReduce-based entity matching with multiple blocking functions

Cheqing JIN; Jie CHEN; Huiping LIU

doi:10.1007/s11704-016-5346-4

PDF(1097 KB)

Front. Comput. Sci. ›› 2017, Vol. 11 ›› Issue (5) : 895-911. DOI: 10.1007/s11704-016-5346-4

RESEARCH ARTICLE

MapReduce-based entity matching with multiple blocking functions

Author information +

History +

Abstract

Entity matching that aims at finding some records belonging to the same real-world objects has been studied for decades. In order to avoid verifying every pair of records in a massive data set, a common method, known as the blockingbased method, tends to select a small proportion of record pairs for verification with a far lower cost thanO(n²), where n is the size of the data set. Furthermore, executing multiple blocking functions independently is critical since much more matching records can be found in this way, so that the quality of the query result can be improved significantly.

It is popular to use the MapReduce (MR) framework to improve the performance and the scalability of some complicated queries by running a lot of map (/reduce) tasks in parallel. However, entity matching upon the MapReduce framework is non-trivial due to two inevitable challenges: load balancing and pair deduplication. In this paper, we propose a novel solution, called MrEm, to handle these challenges with the support of multiple blocking functions. Although the existing work can deal with load balancing and pair deduplication respectively, it still cannot deal with both challenges at the same time. Theoretical analysis and experimental results upon real and synthetic data sets illustrate the high effectiveness and efficiency of our proposed solutions.

Keywords

entity matching / MapReduce / load balancing / pair deduplication

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Cheqing JIN, Jie CHEN, Huiping LIU. MapReduce-based entity matching with multiple blocking functions. Front. Comput. Sci., 2017, 11(5): 895‒911 https://doi.org/10.1007/s11704-016-5346-4

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	BenjellounO, Garcia-Molina H, MenestrinaD , SuQ, WhangS E, WidomJ. Swoosh: a generic approach to entity resolution. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(1): 255–276

[2]	BilenkoM, MooneyR J. Adadptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48

[3]	GuoS T, DongX L, SrivastavaD , ZajacR. Record linkage with uniqueness constraints and erroneous values. Proceedings of the VLDB Endowment, 2010, 3(1–2): 417–428 CrossRef Google scholar

[4]	LiP, DongX L, MaurinoA, Srivastava D. Linkingtemporal records. Proceedings of the VLDB Endowment, 2011, 4(11): 956–967

[5]	RastogiV, DalviN, GarofalakisM . Large-scale collective entity matching. Proceedings of the VLDB Endowment, 2011, 4(4): 208–218 CrossRef Google scholar

[6]	BilenkoM, KamathB, MooneyR J. Adaptive blocking: learning to scale up record linkage. In: Proceedings of the 6th IEEE International Conference on Data Mining. 2006, 87–96 CrossRef Google scholar

[7]	ChristenP. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9): 1537–1555 CrossRef Google scholar

[8]	De VriesT, KeH, ChawlaS, Christen P. Robust record linkage blocking using suffix arrays and bloom filters. ACM Transactions on Knowledge Discovery from Data, 2011, 5(2): 9 CrossRef Google scholar

[9]	MichelsonM, Knoblock C A. Learning blocking schemes for record linkage. In: Proceedings of the National Conference on Artificial Intelligence. 2006, 440–445

[10]	FellegiI P, SunterA B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210 CrossRef Google scholar

[11]	HernándezM A, Stolfo S J. The merge/purge problem for large databases. ACM SIGMOD Record, 1995, 24(2): 127–138 CrossRef Google scholar

[12]	GionisA, IndykP, MotwaniR. Similarity search in high dimensions via hashing. The VLDB Journal — The International Journal on Very Large Data Bases, 1999, 99(6): 518–529

[13]	IndykP, Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th Annual ACM Symposium on Theory of Computing. 1998, 604–613 CrossRef Google scholar

[14]	KolbL, ThorA, RahmE. Multi-pass sorted neighborhood blocking with MapReduce. Computer Science-Research and Development, 2012, 27(1): 45–63 CrossRef Google scholar

[15]	WhangS E, Menestrina D, KoutrikaG , TheobaldM, Garcia-Molina H. Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. 2009, 219–232 CrossRef Google scholar

[16]	KolbL, ThorA, RahmE. Load balancing for MapReduce-based entity resolution. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 618–629 CrossRef Google scholar

[17]	KöpckeH, ThorA, RahmE. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 2010, 3(1–2): 484–493 CrossRef Google scholar

[18]	KolbL, ThorA, RahmE. Don’t match twice:redundancy-free similarity computation with MapReduce. In: Proceedings of the 2nd Workshop on Data Analytics in the Cloud. 2013, 1–5 CrossRef Google scholar

[19]	KolbL, RahmE. Parallel entity resolution with dedoop. Datenbank- Spektrum, 2013, 13(1): 23–32

[20]	DeanJ, Ghemawat S. MapReduce: simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107–113 CrossRef Google scholar

[21]	WhiteT. Hadoop: The Definitive Guide. 3rd ed. O’Reilly Media, Inc., 2012

[22]	MitzenmacherM. Compressed bloom filters. IEEE/ACM Transactions on Networking, 2002, 10(5): 604–612 CrossRef Google scholar

[23]	VernicaR, CareyM J, LiC. Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACMSIGMOD International Conference on Management of Data. 2010, 495–506 CrossRef Google scholar

[24]	BaxterR, Christen P, ChurchesT . A comparison of fast blocking methods for record linkage. ACM SIGKDD, 2003, 3: 25–27

[25]	CohenW W, Richman J. Learning to match and cluster large highdimensional data sets for data integration. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 475–480

[26]	JinL, LiC, MehrotraS. Efficient record linkage in large data sets. In: Proceedings of the 8th International Conference on Database Systems for Advanced Applications. 2003, 137–146

[27]	HeY B, TanH Y, LuoW M, Feng S Z, FanJ P . MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Frontiers of Computer Science, 2014, 8(1): 83–99 CrossRef Google scholar

[28]	Das SarmaA, HeY Y, ChaudhuriS. Clusterjoin: a similarity joins framework using map-reduce. Proceedings of the VLDB Endowment, 2014, 7(12): 1059–1070 CrossRef Google scholar

[29]	DengD, LiG L, HaoS, Wang J N, FengJ H . Massjoin: a MapReducebased method for scalable string similarity joins. In: proceedings of the 30th IEEE International Conference on Data Engineering. 2014, 340–351 CrossRef Google scholar

[30]	KimY, ShimK. Parallel top-k similarity join algorithms using MapReduce. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 510–521 CrossRef Google scholar

RIGHTS & PERMISSIONS

2016 Higher Education Press and Springer-Verlag Berlin Heidelberg

AI Summary AI Mindmap

PDF(1097 KB)

Accesses

Citations

Detail

Sections

Recommended

Received	Accepted	Published
14 Aug 2015	15 Dec 2015	26 Sep 2017
Just Accepted Date	Online First Date	Issue Date
31 Dec 2015	14 Sep 2016	26 Sep 2017

About the journal

Aims & scope

Description

Editorial board

Abstracting / Indexing

Contact us

Browse

Just accepted

Online first

Latest issue

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submisson

Call for papers

Guidelines for authors

Download templates