A survey of density based clustering algorithms

Panthadeep BHATTACHARJEE; Pinaki MITRA

doi:10.1007/s11704-019-9059-3

Front. Comput. Sci. ›› 2021, Vol. 15 ›› Issue (1) :151308 DOI: 10.1007/s11704-019-9059-3

REVIEW ARTICLE

A survey of density based clustering algorithms

Panthadeep BHATTACHARJEE ^†
, Pinaki MITRA

Author information +

History +

PDF (915KB)

Abstract

Density based clustering algorithms (DBCLAs) rely on the notion of density to identify clusters of arbitrary shapes, sizes with varying densities. Existing surveys on DBCLAs cover only a selected set of algorithms. These surveys fail to provide an extensive information about a variety of DBCLAs proposed till date including a taxonomy of the algorithms. In this paper we present a comprehensive survey of various DBCLAs over last two decades along with their classification. We group the DBCLAs in each of the four categories: density definition, parameter sensitivity, execution mode and nature of data and further divide them into various classes under each of these categories. In addition, we compare the DBCLAs through their common features and variations in citation and conceptual dependencies. We identify various application areas of DBCLAs in domains such as astronomy, earth sciences, molecular biology, geography, multimedia. Our survey also identifies probable future directions of DBCLAs where involvement of density based methods may lead to favorable results.

Keywords

clustering / density based clustering / survey / classification / common properties / applications

Cite this article

Download citation ▾

Panthadeep BHATTACHARJEE, Pinaki MITRA. A survey of density based clustering algorithms. Front. Comput. Sci., 2021, 15 (1) : 151308 DOI:10.1007/s11704-019-9059-3

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Jain A K, Murty M N, Flynn P J. Data clustering: a review. ACM Computing Surveys, 1999, 31(3): 264–323

[2]	Yan Z, Luo W, Bu C, Ni L. Clustering spatial data by the neighbors intersection and the density difference. In: Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing Applications and Technologies (BDCAT). 2016, 217–226

[3]	Xu R, Wunsch D. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 2005, 16(3): 645–678

[4]	Ertöz L, Steinbach M, Kumar V. Finding clusters of different sizes, shapes and densities in noisy, high dimensional data. In: Proceedings of the 2003 SIAM International Conference on Data Mining. 2003, 47–58

[5]	Karypis G, Han E H, Kumar V. Chameleon: hierarchical clustering using dynamic modeling. Computer, 1999, 32(8): 68–75

[6]	Ester M, Kriegel H P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceeedings of International Conference on KDD. 1996, 226–231

[7]	Keogh E, Mueen A. Encyclopedia ofMachine Learning and DataMining. Curse of Dimensionality. 2nd ed. Springer, Boston, MA, 2017, 314–315

[8]	Raymond T N, Han J. Clarans: a method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(5): 1003–1016

[9]	Hartigan J A, Wong M A. Algorithm as 136: a k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 1979, 28(1): 100–108

[10]	Silverman B W. Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC press, 1998

[11]	Aggarwal C, Reddy C K. Data Clustering: Algorithms and Applications. CRC press, 2013

[12]	Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD International Conference onManagement of data. 1998, 94–105

[13]	Parimala M, Lopez D, Senthilkumar N C. A survey on density based clustering algorithms for mining large spatial databases. International Journal of Advanced Science and Technology, 2011, 31(1): 59–66

[14]	Ankerst M, Breunig M M, Kriegel H P, Sander J. Optics: ordering points to identify the clustering structure. ACM Sigmod Record, 1999, 28: 49–60

[15]	Ram A, Jalal S, Jalal A S, Kumar M. A density based algorithm for discovering density varied clusters in large spatial databases. International Journal of Computer Applications, 2010, 3(6): 1–4

[16]	Birant D, Kut A. ST-DBSCAN: an algorithm for clustering spatialtemporal data. Data & Knowledge Engineering, 2007, 60(1): 208–221

[17]	Bhuyan R, Borah S. A survey of some density based clustering techniques. In: Proceedings of the 2013 Conference on Advancements in Information, Computer and Communication. 2013

[18]	Chaudhary S, Chauhan R, Batra P. A survey of density based clustering algorithms. International Journal of Computer Science and Technology, 2014, 5(2): 169–171

[19]	Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. 1998, 58–65

[20]	Nagpal P B, Mann P A. Comparative study of density based clustering algorithms. International Journal of Computer Applications, 2011, 27(11): 421–435

[21]	Xu X, Ester M, Kriegel H P, Sander J. A distribution-based clustering algorithm for mining in large spatial databases. In: Proceedings of the 14th IEEE International Conference on Data Engineering. 1998, 324–331

[22]	Thiagarasu V, Prabahari R. A comparative analysis of density based clustering techniques for outlier mining. International Journal of Engineering Sciences and Research Technology, 2014, 3(11): 132–136

[23]	Rajavat A, Dubey P. Comparative study between density based clustering- DBSCAN and OPTICS. International Journal of Advance Computational Engineering and Networking, 2016, 4(12): 34–37

[24]	Smiti A, Eloudi Z. Soft DBSCAN: improving DBSCAN clustering method using fuzzy set theory. In: Proceedings of the 6th International Conference on Human System Interactions. 2013, 380–385

[25]	Loh W K, Park Y H. A survey on density-based clustering algorithms. In: Jeong Y S, Park Y H, Hsu C H, Park J, eds. Ubiquitous Information Technologies and Applications. Springer, Berlin, Heidelberg, 2014, 775–780

[26]	Xu X, Jäger J, Kriegel H P. A fast parallel clustering algorithm for large spatial databases. In: Guo Y, Grossman R, eds. High Performance Data Mining. Springer, Boston, MA, 1999, 263–290

[27]	Böhm C, Noll R, Plant C, Wackersreuther B. Density-based clustering using graphics processors. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. 2009, 661–670

[28]	Loh W K, Moon Y S, Park Y H. Erratum: fast density-based clustering using graphics processing units. IEICE TRANSACTIONS on Information and Systems, 2014, 97(7): 1947–1951

[29]	Ting K M, Zhu Y, Carman M J, Zhu Y, Zhou Z H. Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, 1205–1214.

[30]	Sander J, Ester M, Kriegel H P, Xu X. Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 1998, 2(2): 169–194

[31]	Ester M, Kriegel H P, Sander J, Wimmer M, Xu X. Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 323–333

[32]	Zhou S, Zhou A, Cao J, Wen J, Fan Y, Hu Y. Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2000, 169–172

[33]	Jarvis R A, Patrick E A. Clustering using a similarity measure based on shared near neighbors. IEEE Transactions on Computers, 1973, 100(11): 1025–1034

[34]	Januzaj E, Kriegel H P, Pfeifle M. DBDC: density based distributed clustering. Advances in Database Technology-EDBT 2004, 2004, 529–530

[35]	Borah B, Bhattacharyya D K. An improved sampling-based DBSCAN for large spatial databases. In: Proceedings of IEEE International Conference on Intelligent Sensing and Information Processing. 2004, 92–96

[36]	Tsai C F, Liu C W. KIDBSCAN: a newefficient data clustering algorithm. In: Proceedings of International Conference on Artifical Intelligence and Soft Computing. 2006, 702–711

[37]	Kisilevich S, Mansmann F, Keim D. P-DBSCAN: a density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos. In: Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research and Application. 2010

[38]	An N, Khac L, Kechadi M T. On a distributed approach for density-based clustering. In: Proceedings of the 10th International Conference on Machine Learning and Applications and Workshops. 2011, 283–286

[39]	He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J. MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce. In: Proceedings of the 17th IEEE International Conference on Parallel and Distributed Systems. 2011, 473–480

[40]	Campello R J, Moulavi D, Sander J. Density-based clustering based on hierarchical density estimates. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2013, 160–172

[41]	Yu Y, Zhao J, Wang X, Wang Q, Zhang Y. Cludoop: an efficient distributed density-based clustering for big data using hadoop. International Journal of Distributed Sensor Networks, 2015, 11(6): 379–391

[42]	Hou J, Gao H, Li X. DSets-DBSCAN: a parameter-free clustering algorithm. IEEE Transactions on Image Processing, 2016, 25(7): 3182–3193

[43]	Pavan M, Pelillo M. Dominant sets and pairwise clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 29(1): 167–172

[44]	Gan J, Tao Y. Dynamic density based clustering. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, 1493–1507

[45]	Wang W, Yang J, Muntz R. Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases. 1997, 186–195

[46]	Davis L. Bit-climbing, representational bias, and test suite design. In: Proceedings of the 4th International Conference on Genetic Algorithms. 1991, 18–23

[47]	Hinneburg A, Keim D A. Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th International Conference on Very Large Data Bases. 1999

[48]	Zhang T, Ramakrishnan R, Livny M. BIRCH: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1997, 1(2): 141–182

[49]	Sheikholeslami G, Chatterjee S, Zhang A. WaveCluster: a multiresolution clustering approach for very large spatial databases. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 428–439

[50]	Aggarwal C C. A human-computer interactive method for projected clustering. IEEE Transactions on Knowledge and Data Engineering, 2004, 16(4): 448–460

[51]	Chen Y, Tu L. Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007, 133–142

[52]	Aggarwal C C. Outlier Analysis. In: Data Mining. Springer, 2015, 237–263

[53]	Kriegel H P, Pfeifle M. Hierarchical density-based clustering of uncertain data. In: Proceedings of the 5th IEEE International Conference on Data Mining. 2005

[54]	Kriegel H P, Pfeifle M. Density-based clustering of uncertain data. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. 2005, 672–677

[55]	Stonebraker M, Frew J, Dozier J. The SEQUOIA 2000 Project. In: Proceedings of International Symposium on Spatial Databases. 1993, 397–412

[56]	Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. ACM Sigmod Record, 1996, 25(2): 103–114

[57]	Zheng Y, Xie X, Ma W Y. Geolife: a collaborative social networking service among user, location and trajectory. IEEE Data Engineering Bulletin, 2010, 33(2): 32–39

[58]	Yuan J, Zheng Y, Xie X, Sun G. T-drive: enhancing driving directions with taxi drivers’ intelligence. IEEE Transactions on Knowledge and Data Engineering, 2011, 25(1): 220–232, 2011

[59]	Yeung K Y, Fraley C, Murua A, Raftery A E, Ruzzo W L. Model-based clustering and data transformations for gene expression data. Bioinformatics, 2001, 17(10): 977–987

[60]	Bishop C M. Pattern Recognition and Machine Learning. Springer, 2006

[61]	Lu C D, Zhang T Y, Du X Z, Li C P. A robust kernel PCA algorithm. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics. 2004, 3084–3087

[62]	Balakrishnama S, Ganapathiraju A. Linear discriminant analysis-a brief tutorial. Institute for Signal and Information Processing, 1998, 18: 1–8

[63]	Baudat G, Anouar F. Generalized discriminant analysis using a kernel approach. Neural Computation, 2000, 12(10): 2385–2404

[64]	Epanechnikov V A. Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 1969, 14(1): 153–158

[65]	Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. ACM Sigmod Record, 1998, 27(2): 73–84

[66]	Tavallaee M, Bagheri E, Lu W, Ghorbani A A. A detailed analysis of the KDD cup 99 data set. In: Proceedings of IEEE Symposium on Computational Intelligence for Security and Defense Applications. 2009, 1–6

[67]	Bader G D, Hogue C W V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 2003, 4(1): 2

[68]	Connolly M L. Measurement of protein surface shape by solid angles. Journal of Molecular Graphics, 1986, 4(1): 3–6

[69]	Becker R H, White R L, Helfand D J. The first survey: faint images of the radio sky at twenty centimeters. The Astrophysical Journal, 1995, 450: 559

[70]	Zepka A F, Cordes J M, Wasserman I. Signal detection amid noise with known statistics. The Astrophysical Journal, 1994, 427: 438–445

[71]	Weir N, Fayyad U M, Djorgovski S. Automated star/galaxy classification for digitized POSS-II. The Astronomical Journal, 1995, 109: 2401

[72]	Chrisman N R, Dougenik J A, White D. Lessons for the design of polygon overlay processing from the odyssey whirlpool algorithm. In: Proceedings of the 5th International Symposium on Spatial Data Handling. 1992, 401–410

[73]	Du Q, Dong Z, Huang C, Ren F. Density-based clustering with geographical background constraints using a semantic expression model. ISPRS International Journal of Geo-Information, 2016, 5(5): 72

[74]	Hendler J. Data integration for heterogenous datasets. Big Data, 2014, 2(4): 205–215

[75]	Allen T E, Herrgård M J, Liu M, Qiu Y, Glasner J, Blattner F R, Palsson B. Genome-scale analysis of the uses of the escherichia coli genome: modeldriven analysis of heterogeneous data sets. Journal of Bacteriology, 2003, 185(21): 6392–6399

[76]	Pavlidis P, Weston J, Cai J, Grundy W N. Gene functional classification from heterogeneous data. In: Proceedings of the 5th Annual International Conference on Computational Biology, 2001, 249–255

[77]	Angelier J. Tectonic analysis of fault slip data sets. Journal of Geophysical Research: Solid Earth, 1984, 89(B7): 5835–5848

[78]	Kowalski M, Rubin D, Aldering G, Agostinho R J, Amadon A, Amanullah R, Balland C, Barbary K, Blanc G, Challis P J, et al. Improved cosmological constraints from new, old, and combined supernova data sets. The Astrophysical Journal, 2008, 686(2): 749

[79]	Nguyen M D, Shin W Y. DBSTexC: density-based spatio–textual clustering on twitter. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 2017, 23–26