MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

Yaobin HE, Haoyu TAN, Wuman LUO, Shengzhong FENG, Jianping FAN

PDF(1578 KB)
PDF(1578 KB)
Front. Comput. Sci. ›› 2014, Vol. 8 ›› Issue (1) : 83-99. DOI: 10.1007/s11704-013-3158-3
RESEARCH ARTICLE

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

Author information +
History +

Abstract

DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.

Keywords

data clustering / parallel algorithm / data mining / load balancing

Cite this article

Download citation ▾
Yaobin HE, Haoyu TAN, Wuman LUO, Shengzhong FENG, Jianping FAN. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci., 2014, 8(1): 83‒99 https://doi.org/10.1007/s11704-013-3158-3

References

[1]
Ester M, Kriegel H P, Sander J, Xu X. A densitybased algorithm for discovering clusters in large spatial databases. Data Mining and Knowledge Discovery, 1996, 96: 226−231
[2]
MacQueen J B. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 281−297
[3]
Zhang T, Ramakrishnan R, Livny M. Birch: an efficient data clustering method for very large databases. In: Proceedings of 1996 the ACM SIGMOD Conference on Managemnet of Data. 1996, 103−114
CrossRef Google scholar
[4]
Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statisticai Societ, 1977, 39(1): 1−38
[5]
Wang W, Yang J, Muntz R R. Sting: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, 1997, 186−195
[6]
Microsoft Academic Search. Top publications in data mining. 2013
[7]
Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. 2008, 107−113
[8]
White T. Hadoop: The Definitive Guide, 1st edition. O’Reilly Media, Inc., 2009
[9]
Berger M, Bokhari S. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers, 1987, 36: 570−580
CrossRef Google scholar
[10]
Dai B R, Lin I C. Efficient map/reduce-based dbscan algorithm with optimized data partition. In: Proceedings of the 5th IEEE International Conference on Cloud Computing. 2012, 59−66
[11]
Leutenegger S T, Edgington J M, Lopez M A. Str: a simple and efficient algorithm for r-tree packing. In: Proceedings of the 1997 IEEE International Conference on Data Engineering. 1997, 497−506
[12]
Theodoridis Y, Sellis T. A model for the prediction of r-tree perfor-mance. In: Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 1996, 161−171
[13]
United States Census Bureau. TIGER/Line Shapefiles. http://www. census.gov/geo/maps-data/data/tiger-line.html
[14]
Sander J, Ester M, Kriegel H P, Xu X. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Mining and Knowledge Discovery, 1998, 2(2): 169−194
CrossRef Google scholar
[15]
Ankerst M, Breunig M M, Kriegel H P, Sander J. Optics: ordering points to identify the clustering structure. SIGMOD Record, 1999, 28: 49−60
CrossRef Google scholar
[16]
Januzaj E, Kriegel H P, Pfeifle M. Scalable density-based distributed clustering. In: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2004, 231−244
[17]
Zhao W, Ma H, He Q. Parallel k-means clustering based on mapreduce. In: Proceedings of the 1st International Conference on Cloud Computing. 2009, 674−679
[18]
Kwon Y, Nunley D, Gardner J P, Balazinska M, Howe B, Loebman S. Scalable clustering algorithm for n-body simulations in a sharednothing cluster. In: Proceedings of the 22nd International Conference on Scientific and Statistical Database Management. 2010, 132−150
[19]
Bentley J L. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18: 509−517
CrossRef Google scholar
[20]
Xu X, Jäger J, Kriegel H P. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery, 1999, 3: 263−290
CrossRef Google scholar
[21]
He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J. MR-DBSCAN: an efficient parallel density-based clustering algorithm using mapreduce. In: Proceedings of the 2011 IEEE International Conference on Parallel and Distributed Systems. 2011, 473−480
CrossRef Google scholar

RIGHTS & PERMISSIONS

2013 Higher Education Press and Springer-Verlag Berlin Heidelberg
AI Summary AI Mindmap
PDF(1578 KB)

Accesses

Citations

Detail

Sections
Recommended

/