An effective framework for characterizing rare categories

Jingrui HE , Hanghang TONG , Jaime CARBONELL

Front. Comput. Sci. ›› 2012, Vol. 6 ›› Issue (2) : 154 -165.

PDF (830KB)
Front. Comput. Sci. ›› 2012, Vol. 6 ›› Issue (2) : 154 -165. DOI: 11.1007/s11704-012-2861-9
RESEARCH ARTICLE

An effective framework for characterizing rare categories

Author information +
History +
PDF (830KB)

Abstract

Rare categories become more and more abundant and their characterization has received little attention thus far. Fraudulent banking transactions, network intrusions, and rare diseases are examples of rare classes whose detection and characterization are of high value. However, accurate characterization is challenging due to high-skewness and nonseparability from majority classes, e.g., fraudulent transactions masquerade as legitimate ones. This paper proposes the RACH algorithm by exploring the compactness property of the rare categories. This algorithm is semi-supervised in nature since it uses both labeled and unlabeled data. It is based on an optimization framework which encloses the rare examples by a minimum-radius hyperball. The framework is then converted into a convex optimization problem, which is in turn effectively solved in its dual form by the projected subgradient method. RACH can be naturally kernelized. Experimental results validate the effectiveness of RACH.

Keywords

rare category / minority class / characterization / compactness / optimization / hyperball / subgradient

Cite this article

Download citation ▾
Jingrui HE, Hanghang TONG, Jaime CARBONELL. An effective framework for characterizing rare categories. Front. Comput. Sci., 2012, 6(2): 154-165 DOI:11.1007/s11704-012-2861-9

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Chau D H, Pandit S, Faloutsos C. Detecting fraudulent personalities in networks of online auctioneers. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2006, 103-114

[2]

EURODIS. Rare diseases: understanding this public health priority. 2005,

[3]

Pelleg D, Moore A W. Active learning for anomaly and rare-category detection. In: Proceedings of 2004 Neural Information Processing Systems. 2004

[4]

Fine S, Mansour Y. Active sampling for multiple output identification. In: Proceedings of the 19th Annual Conference on Learning Theory. 2006, 620-634

[5]

He J, Carbonell J. Nearest-neighbor-based active learning for rare category detection. In: Proceedings of 2007 Neural Information Processing Systems. 2007

[6]

Dasgupta S, Hsu D. Hierarchical sampling for active learning. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 208-215

[7]

Vatturi P, Wong WK. Category detection using hierarchical mean shift. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 847-856

[8]

Japkowicz N. Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets. Menlo Park: AAAI Press, 2000

[9]

Chawla N V, Japkowicz N, Kolcz A. Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Data Sets. 2003

[10]

Chawla N V, Japkowicz N, Kotcz A. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 1-6

[11]

Ling C X, Li C. Data mining for direct marketing: problems and solutions. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. 1998, 73-79

[12]

Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321-357

[13]

Cieslak D A, Chawla N V. Start globally, optimize locally, predict globally: improving performance on imbalanced data. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008, 143-152

[14]

Köknar-Tezel S, Latecki L. Improving SVM classification on imbalanced time series data sets with ghost points. Knowledge and Information Systems, 2011, 28(1): 1-23

[15]

Chawla N V, Lazarevic A, Hall L O, Bowyer K W. Smoteboost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2003, 107-119

[16]

Sun Y, Kamel M S, Wang Y. Boosting for learning multiple classes with imbalanced class distribution. In: Proceedings of the 6th IEEE International Conference on Data Mining. 2006, 592-602

[17]

Wang B, Japkowicz N. Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 2010, 25(1): 1-20

[18]

Wu J, Xiong H, Wu P, Chen J. Local decomposition for rare class analysis. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007, 814-823

[19]

Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Computing Surveys, 2009, 41(3): 1-58

[20]

Barbará D, Wu N, Jajodia S. Detecting novel network intrusions using Bayes estimators. In: Proceedings of the 1st SIAMConference on Data Mining. 2001

[21]

Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. 2000, 427-438

[22]

de Vries T, Chawla S, Houle M E. Density-preserving projections for large-scale local anomaly detection. Knowledge and Information Systems (in Press)

[23]

Bhaduri K, Matthews B L, Giannella C. Algorithms for speeding up distance-based outlier detection. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2011, 859-867

[24]

Yu D, Sheikholeslami G, Zhang A. FindOut: finding outliers in very large datasets. Knowledge and Information Systems, 2002, 4(4): 387-412

[25]

Gao J, Liang F, Fan W, Wang C, Sun Y, Han J. On community outliers and their efficient detection in information networks. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 813-822

[26]

He Z, Xu X, Deng S. An optimization model for outlier detection in categorical data. The Computing Research Repository, 2005, abs/cs/0503081

[27]

Dutta H, Giannella C, Borne K D, Kargupta H. Distributed top-k outlier detection from astronomy catalogs using the DEMAC system. In: Proceedings of the 7th SIAM International Conference on Data Mining. 2007

[28]

Aggarwal C C, Yu P S. Outlier detection for high dimensional data. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. 2001, 37-46

[29]

Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems, 2011, 26(2): 309-336

[30]

Chen F, Lu C T, Boedihardjo A P. GLS-SOD: a generalized local statistical approach for spatial outlier detection. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 1069-1078

[31]

Papadimitriou S, Kitagawa H, Gibbons P B, Faloutsos C. LOCI: fast outlier detection using the local correlation integral. In: Proceedings of the 19th International Conference on Data Engineering. 2003, 315-327

[32]

Görnitz N, Kloft M, Brefeld U. Active and semi-supervised data domain description. In: Proceedings of European Conference onMachine Learning and Knowledge Discovery in Databases, Part I. 2009, 407-422

[33]

Schölkopf B, Platt J C, Shawe-Taylor J, Smola A J, Williamson R C. Estimating the support of a high-dimensional distribution. Neural Computation, 2001, 13(7): 1443-1471

[34]

Joachims T. A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning. 2005, 377-384

[35]

Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press, 2004

[36]

Duchi J, Shalev-Shwartz S, Singer Y, Chandra T. Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 272-279

[37]

Zhou D, Weston J, Gretton A, Bousquet O, Schölkopf B. Ranking on data manifolds. In:Proceedings of 2003 Neural Information Processing Systems. 2003

[38]

Joachims T. Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning. 1999, 200-209

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag Berlin Heidelberg

AI Summary AI Mindmap
PDF (830KB)

351

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/