An extended EM algorithm for subspace clustering

doi:10.1007/s11704-008-0007-x

PDF(92 KB)

Front. Comput. Sci. ›› 2008, Vol. 2 ›› Issue (1) : 81-86. DOI: 10.1007/s11704-008-0007-x

An extended EM algorithm for subspace clustering

CHEN Lifei¹, JIANG Qingshan²

Author information +

History +

Abstract

Clustering high dimensional data has become a challenge in data mining due to the curse of dimensionality. To solve this problem, subspace clustering has been defined as an extension of traditional clustering that seeks to find clusters in subspaces spanned by different combinations of dimensions within a dataset. This paper presents a new subspace clustering algorithm that calculates the local feature weights automatically in an EM-based clustering process. In the algorithm, the features are locally weighted by using a new unsupervised weighting method, as a means to minimize a proposed clustering criterion that takes into account both the average intra-clusters compactness and the average inter-clusters separation for subspace clustering. For the purposes of capturing accurate subspace information, an additional outlier detection process is presented to identify the possible local outliers of subspace clusters, and is embedded between the E-step and M-step of the algorithm. The method has been evaluated in clustering real-world gene expression data and high dimensional artificial data with outliers, and the experimental results have shown its effectiveness.

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

CHEN Lifei, JIANG Qingshan. An extended EM algorithm for subspace clustering. Front. Comput. Sci., 2008, 2(1): 81‒86 https://doi.org/10.1007/s11704-008-0007-x

This is a preview of subscription content, contact us for subscripton.

References

1. Berkhin P A surveyof clustering data mining techniquesKoganJNicholas CTeboulle MGrouping multidimensionaldata: recent advances in clusteringBerlinSpringer 2006 2571
2. Parsons L Haque E Liu H Subspace clustering for high dimensional data: a reviewACM SIGKDD Explorations Newsletter 2004 6(1)90105
3. Hinneburg A Aggarwal C C Kaim D What is the nearest neighbor in high dimensional spacesProceedings of VLDBBerlinSpringer 2000 506515
4. Dash M Liu M Yao J Dimensionality reduction for unsupervised dataProceedings of ICTAINewport BeachIEEE Computer Society 1997 532539
5. Han E-H Karypis G Clustering in a high-dimensionalspace using hypergraph modelsTechnicalReport, TR-97-063, Universyty of Minnesota 1997
6. Aggarwal C C Procopiuc C Wolf J L et al.Fast algorithm for projected clusteringProceedings of ACM SIGMODNew YorkACM 1999 6172
7. Agrawal R Gehrke J Gunopulos D et al.Automatic subspace clustering of high dimensionaldata for data mining applicationsProceedingsof ACM SIGMODNew YorkACM 1998 94105
8. Cheng C H Fu A W Zhang Y Entropy-based subspace clustering for mining numericaldataProceedings of ACM SIGKDDNew YorkACM 1999 8493
9. Goil S Nagesh H Choudhary A Mafia: efficient and scalable subspace clustering for verylarge data setsTechnical Report CPDC-TR-9906-010,Northwestern University 1999
10. Domeniconi C Gunopulos D Ma S et al.Locally adaptive metrics for clustering high dimensionaldataTechnical Report ISE-TR-06-04 2006
11. Jing L Ng M K Xu J et al.On the performance of feature weighting K-means for text subspace clusteringProceedings of WAIM 2005 205212
12. Wu C F J On the convergence properties of the EM algorithmAnnals of Statistics 1983 11(1)95103
13. Friedman J H Meulman J J Clustering objects on subsetsof attributesJournal of the Royal StatisticalSociety: Series B (Statistical Methodology) 2004 66(4)815849
14. Candillier L Tellier I Torre F et al.SuSE: subspace selection embedded in an EM algorithmProceedings of CAP 2006 331345
15. Chen L F Jiang Q S Wang S R A new unsupervised term weighting scheme for document clusteringJournal of Computational Information Systems 2007 3(4)14551464
16. Aggarwal C C Yu P S Outlier detection for highdimensional dataProceedings of ACM SIGMODNew YorkACM 2001 219234
17. Gan G Wu J Yang Z A fuzzy subspace algorithm for clustering high dimensionaldataLNAI 2006 4093271278
18. Sun H Wang S Jiang Q FCM-based model selection algorithms for determining thenumber of clustersPattern Recognition 2004 37(10)20272037
19. Golub T R Slonim D K Tamayo P et al.Molecular classification of cancer: class discoveryand class prediction by gene expression monitoringScience 1999 286531537
20. Gordon G J Jensen R V Hsiao L L et al.Translation of microarray data into clinically relevantcancer diagnostic tests using gege expression ratios in lung cancerand mesotheliomaCancer Research 2002 6249634967
21. Tan S Cheng X Ghanem M M et al.A novel refinement approach for text categorizationProceedings of ACM CIKMNew YorkACM 2005 469476