A survey of density based clustering algorithms
Panthadeep BHATTACHARJEE, Pinaki MITRA
A survey of density based clustering algorithms
Density based clustering algorithms (DBCLAs) rely on the notion of density to identify clusters of arbitrary shapes, sizes with varying densities. Existing surveys on DBCLAs cover only a selected set of algorithms. These surveys fail to provide an extensive information about a variety of DBCLAs proposed till date including a taxonomy of the algorithms. In this paper we present a comprehensive survey of various DBCLAs over last two decades along with their classification. We group the DBCLAs in each of the four categories: density definition, parameter sensitivity, execution mode and nature of data and further divide them into various classes under each of these categories. In addition, we compare the DBCLAs through their common features and variations in citation and conceptual dependencies. We identify various application areas of DBCLAs in domains such as astronomy, earth sciences, molecular biology, geography, multimedia. Our survey also identifies probable future directions of DBCLAs where involvement of density based methods may lead to favorable results.
clustering / density based clustering / survey / classification / common properties / applications
[1] |
Jain A K, Murty M N, Flynn P J. Data clustering: a review. ACM Computing Surveys, 1999, 31(3): 264–323
CrossRef
Google scholar
|
[2] |
Yan Z, Luo W, Bu C, Ni L. Clustering spatial data by the neighbors intersection and the density difference. In: Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing Applications and Technologies (BDCAT). 2016, 217–226
CrossRef
Google scholar
|
[3] |
Xu R, Wunsch D. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 2005, 16(3): 645–678
CrossRef
Google scholar
|
[4] |
Ertöz L, Steinbach M, Kumar V. Finding clusters of different sizes, shapes and densities in noisy, high dimensional data. In: Proceedings of the 2003 SIAM International Conference on Data Mining. 2003, 47–58
CrossRef
Google scholar
|
[5] |
Karypis G, Han E H, Kumar V. Chameleon: hierarchical clustering using dynamic modeling. Computer, 1999, 32(8): 68–75
CrossRef
Google scholar
|
[6] |
Ester M, Kriegel H P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceeedings of International Conference on KDD. 1996, 226–231
|
[7] |
Keogh E, Mueen A. Encyclopedia ofMachine Learning and DataMining. Curse of Dimensionality. 2nd ed. Springer, Boston, MA, 2017, 314–315
CrossRef
Google scholar
|
[8] |
Raymond T N, Han J. Clarans: a method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(5): 1003–1016
CrossRef
Google scholar
|
[9] |
Hartigan J A, Wong M A. Algorithm as 136: a k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 1979, 28(1): 100–108
CrossRef
Google scholar
|
[10] |
Silverman B W. Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC press, 1998
|
[11] |
Aggarwal C, Reddy C K. Data Clustering: Algorithms and Applications. CRC press, 2013
CrossRef
Google scholar
|
[12] |
Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD International Conference onManagement of data. 1998, 94–105
CrossRef
Google scholar
|
[13] |
Parimala M, Lopez D, Senthilkumar N C. A survey on density based clustering algorithms for mining large spatial databases. International Journal of Advanced Science and Technology, 2011, 31(1): 59–66
|
[14] |
Ankerst M, Breunig M M, Kriegel H P, Sander J. Optics: ordering points to identify the clustering structure. ACM Sigmod Record, 1999, 28: 49–60
CrossRef
Google scholar
|
[15] |
Ram A, Jalal S, Jalal A S, Kumar M. A density based algorithm for discovering density varied clusters in large spatial databases. International Journal of Computer Applications, 2010, 3(6): 1–4
CrossRef
Google scholar
|
[16] |
Birant D, Kut A. ST-DBSCAN: an algorithm for clustering spatialtemporal data. Data & Knowledge Engineering, 2007, 60(1): 208–221
CrossRef
Google scholar
|
[17] |
Bhuyan R, Borah S. A survey of some density based clustering techniques. In: Proceedings of the 2013 Conference on Advancements in Information, Computer and Communication. 2013
|
[18] |
Chaudhary S, Chauhan R, Batra P. A survey of density based clustering algorithms. International Journal of Computer Science and Technology, 2014, 5(2): 169–171
|
[19] |
Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. 1998, 58–65
|
[20] |
Nagpal P B, Mann P A. Comparative study of density based clustering algorithms. International Journal of Computer Applications, 2011, 27(11): 421–435
CrossRef
Google scholar
|
[21] |
Xu X, Ester M, Kriegel H P, Sander J. A distribution-based clustering algorithm for mining in large spatial databases. In: Proceedings of the 14th IEEE International Conference on Data Engineering. 1998, 324–331
|
[22] |
Thiagarasu V, Prabahari R. A comparative analysis of density based clustering techniques for outlier mining. International Journal of Engineering Sciences and Research Technology, 2014, 3(11): 132–136
|
[23] |
Rajavat A, Dubey P. Comparative study between density based clustering- DBSCAN and OPTICS. International Journal of Advance Computational Engineering and Networking, 2016, 4(12): 34–37
|
[24] |
Smiti A, Eloudi Z. Soft DBSCAN: improving DBSCAN clustering method using fuzzy set theory. In: Proceedings of the 6th International Conference on Human System Interactions. 2013, 380–385
CrossRef
Google scholar
|
[25] |
Loh W K, Park Y H. A survey on density-based clustering algorithms. In: Jeong Y S, Park Y H, Hsu C H, Park J, eds. Ubiquitous Information Technologies and Applications. Springer, Berlin, Heidelberg, 2014, 775–780
CrossRef
Google scholar
|
[26] |
Xu X, Jäger J, Kriegel H P. A fast parallel clustering algorithm for large spatial databases. In: Guo Y, Grossman R, eds. High Performance Data Mining. Springer, Boston, MA, 1999, 263–290
CrossRef
Google scholar
|
[27] |
Böhm C, Noll R, Plant C, Wackersreuther B. Density-based clustering using graphics processors. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. 2009, 661–670
CrossRef
Google scholar
|
[28] |
Loh W K, Moon Y S, Park Y H. Erratum: fast density-based clustering using graphics processing units. IEICE TRANSACTIONS on Information and Systems, 2014, 97(7): 1947–1951
CrossRef
Google scholar
|
[29] |
Ting K M, Zhu Y, Carman M J, Zhu Y, Zhou Z H. Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, 1205–1214.
CrossRef
Google scholar
|
[30] |
Sander J, Ester M, Kriegel H P, Xu X. Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 1998, 2(2): 169–194
CrossRef
Google scholar
|
[31] |
Ester M, Kriegel H P, Sander J, Wimmer M, Xu X. Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 323–333
|
[32] |
Zhou S, Zhou A, Cao J, Wen J, Fan Y, Hu Y. Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2000, 169–172
CrossRef
Google scholar
|
[33] |
Jarvis R A, Patrick E A. Clustering using a similarity measure based on shared near neighbors. IEEE Transactions on Computers, 1973, 100(11): 1025–1034
CrossRef
Google scholar
|
[34] |
Januzaj E, Kriegel H P, Pfeifle M. DBDC: density based distributed clustering. Advances in Database Technology-EDBT 2004, 2004, 529–530
CrossRef
Google scholar
|
[35] |
Borah B, Bhattacharyya D K. An improved sampling-based DBSCAN for large spatial databases. In: Proceedings of IEEE International Conference on Intelligent Sensing and Information Processing. 2004, 92–96
|
[36] |
Tsai C F, Liu C W. KIDBSCAN: a newefficient data clustering algorithm. In: Proceedings of International Conference on Artifical Intelligence and Soft Computing. 2006, 702–711
CrossRef
Google scholar
|
[37] |
Kisilevich S, Mansmann F, Keim D. P-DBSCAN: a density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos. In: Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research and Application. 2010
CrossRef
Google scholar
|
[38] |
An N, Khac L, Kechadi M T. On a distributed approach for density-based clustering. In: Proceedings of the 10th International Conference on Machine Learning and Applications and Workshops. 2011, 283–286
|
[39] |
He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J. MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce. In: Proceedings of the 17th IEEE International Conference on Parallel and Distributed Systems. 2011, 473–480
CrossRef
Google scholar
|
[40] |
Campello R J, Moulavi D, Sander J. Density-based clustering based on hierarchical density estimates. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2013, 160–172
CrossRef
Google scholar
|
[41] |
Yu Y, Zhao J, Wang X, Wang Q, Zhang Y. Cludoop: an efficient distributed density-based clustering for big data using hadoop. International Journal of Distributed Sensor Networks, 2015, 11(6): 379–391
CrossRef
Google scholar
|
[42] |
Hou J, Gao H, Li X. DSets-DBSCAN: a parameter-free clustering algorithm. IEEE Transactions on Image Processing, 2016, 25(7): 3182–3193
CrossRef
Google scholar
|
[43] |
Pavan M, Pelillo M. Dominant sets and pairwise clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 29(1): 167–172
CrossRef
Google scholar
|
[44] |
Gan J, Tao Y. Dynamic density based clustering. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, 1493–1507
CrossRef
Google scholar
|
[45] |
Wang W, Yang J, Muntz R. Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases. 1997, 186–195
|
[46] |
Davis L. Bit-climbing, representational bias, and test suite design. In: Proceedings of the 4th International Conference on Genetic Algorithms. 1991, 18–23
|
[47] |
Hinneburg A, Keim D A. Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th International Conference on Very Large Data Bases. 1999
|
[48] |
Zhang T, Ramakrishnan R, Livny M. BIRCH: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1997, 1(2): 141–182
CrossRef
Google scholar
|
[49] |
Sheikholeslami G, Chatterjee S, Zhang A. WaveCluster: a multiresolution clustering approach for very large spatial databases. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 428–439
|
[50] |
Aggarwal C C. A human-computer interactive method for projected clustering. IEEE Transactions on Knowledge and Data Engineering, 2004, 16(4): 448–460
CrossRef
Google scholar
|
[51] |
Chen Y, Tu L. Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007, 133–142
CrossRef
Google scholar
|
[52] |
Aggarwal C C. Outlier Analysis. In: Data Mining. Springer, 2015, 237–263
CrossRef
Google scholar
|
[53] |
Kriegel H P, Pfeifle M. Hierarchical density-based clustering of uncertain data. In: Proceedings of the 5th IEEE International Conference on Data Mining. 2005
CrossRef
Google scholar
|
[54] |
Kriegel H P, Pfeifle M. Density-based clustering of uncertain data. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. 2005, 672–677
CrossRef
Google scholar
|
[55] |
Stonebraker M, Frew J, Dozier J. The SEQUOIA 2000 Project. In: Proceedings of International Symposium on Spatial Databases. 1993, 397–412
CrossRef
Google scholar
|
[56] |
Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. ACM Sigmod Record, 1996, 25(2): 103–114
CrossRef
Google scholar
|
[57] |
Zheng Y, Xie X, Ma W Y. Geolife: a collaborative social networking service among user, location and trajectory. IEEE Data Engineering Bulletin, 2010, 33(2): 32–39
|
[58] |
Yuan J, Zheng Y, Xie X, Sun G. T-drive: enhancing driving directions with taxi drivers’ intelligence. IEEE Transactions on Knowledge and Data Engineering, 2011, 25(1): 220–232, 2011
CrossRef
Google scholar
|
[59] |
Yeung K Y, Fraley C, Murua A, Raftery A E, Ruzzo W L. Model-based clustering and data transformations for gene expression data. Bioinformatics, 2001, 17(10): 977–987
CrossRef
Google scholar
|
[60] |
Bishop C M. Pattern Recognition and Machine Learning. Springer, 2006
|
[61] |
Lu C D, Zhang T Y, Du X Z, Li C P. A robust kernel PCA algorithm. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics. 2004, 3084–3087
|
[62] |
Balakrishnama S, Ganapathiraju A. Linear discriminant analysis-a brief tutorial. Institute for Signal and Information Processing, 1998, 18: 1–8
|
[63] |
Baudat G, Anouar F. Generalized discriminant analysis using a kernel approach. Neural Computation, 2000, 12(10): 2385–2404
CrossRef
Google scholar
|
[64] |
Epanechnikov V A. Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 1969, 14(1): 153–158
CrossRef
Google scholar
|
[65] |
Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. ACM Sigmod Record, 1998, 27(2): 73–84
CrossRef
Google scholar
|
[66] |
Tavallaee M, Bagheri E, Lu W, Ghorbani A A. A detailed analysis of the KDD cup 99 data set. In: Proceedings of IEEE Symposium on Computational Intelligence for Security and Defense Applications. 2009, 1–6
CrossRef
Google scholar
|
[67] |
Bader G D, Hogue C W V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 2003, 4(1): 2
CrossRef
Google scholar
|
[68] |
Connolly M L. Measurement of protein surface shape by solid angles. Journal of Molecular Graphics, 1986, 4(1): 3–6
CrossRef
Google scholar
|
[69] |
Becker R H, White R L, Helfand D J. The first survey: faint images of the radio sky at twenty centimeters. The Astrophysical Journal, 1995, 450: 559
CrossRef
Google scholar
|
[70] |
Zepka A F, Cordes J M, Wasserman I. Signal detection amid noise with known statistics. The Astrophysical Journal, 1994, 427: 438–445
CrossRef
Google scholar
|
[71] |
Weir N, Fayyad U M, Djorgovski S. Automated star/galaxy classification for digitized POSS-II. The Astronomical Journal, 1995, 109: 2401
CrossRef
Google scholar
|
[72] |
Chrisman N R, Dougenik J A, White D. Lessons for the design of polygon overlay processing from the odyssey whirlpool algorithm. In: Proceedings of the 5th International Symposium on Spatial Data Handling. 1992, 401–410
|
[73] |
Du Q, Dong Z, Huang C, Ren F. Density-based clustering with geographical background constraints using a semantic expression model. ISPRS International Journal of Geo-Information, 2016, 5(5): 72
CrossRef
Google scholar
|
[74] |
Hendler J. Data integration for heterogenous datasets. Big Data, 2014, 2(4): 205–215
CrossRef
Google scholar
|
[75] |
Allen T E, Herrgård M J, Liu M, Qiu Y, Glasner J, Blattner F R, Palsson B. Genome-scale analysis of the uses of the escherichia coli genome: modeldriven analysis of heterogeneous data sets. Journal of Bacteriology, 2003, 185(21): 6392–6399
CrossRef
Google scholar
|
[76] |
Pavlidis P, Weston J, Cai J, Grundy W N. Gene functional classification from heterogeneous data. In: Proceedings of the 5th Annual International Conference on Computational Biology, 2001, 249–255
CrossRef
Google scholar
|
[77] |
Angelier J. Tectonic analysis of fault slip data sets. Journal of Geophysical Research: Solid Earth, 1984, 89(B7): 5835–5848
CrossRef
Google scholar
|
[78] |
Kowalski M, Rubin D, Aldering G, Agostinho R J, Amadon A, Amanullah R, Balland C, Barbary K, Blanc G, Challis P J, et al. Improved cosmological constraints from new, old, and combined supernova data sets. The Astrophysical Journal, 2008, 686(2): 749
CrossRef
Google scholar
|
[79] |
Nguyen M D, Shin W Y. DBSTexC: density-based spatio–textual clustering on twitter. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 2017, 23–26
CrossRef
Google scholar
|
/
〈 | 〉 |