Learning from imbalanced data sets with a Min-Max modular support vector machine

Bao-Liang LU; Xiao-Lin WANG; Yang YANG; Hai ZHAO

doi:10.1007/s11460-011-0127-1

Frontiers of Electrical and Electronic Engineering >

0 56 - 71

DOI: https://doi.org/10.1007/s11460-011-0127-1

RESEARCH ARTICLE

Learning from imbalanced data sets with a Min-Max modular support vector machine

Bao-Liang LU ^,¹^,² ,
Xiao-Lin WANG ¹ ,
Yang YANG ³ ,
Hai ZHAO ¹^,²

Expand

1. Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
2. MOE-Microsoft Key Laboratory for Intelligent Computing and Intelligent Systems, Shanghai Jiao Tong University, Shanghai 200240, China
3. Department of Computer Science and Engineering, Shanghai Maritime University, Shanghai 201306, China

Received date: 22 Jul 2010

Accepted date: 18 Oct 2010

Published date: 05 Mar 2011

Copyright

2014 Higher Education Press and Springer-Verlag Berlin Heidelberg

Fold

Abstract

Imbalanced data sets have significantly unequal distributions between classes. This between-class imbalance causes conventional classification methods to favor majority classes, resulting in very low or even no detection of minority classes. A Min-Max modular support vector machine (M³-SVM) approaches this problem by decomposing the training input sets of the majority classes into subsets of similar size and pairing them into balanced two-class classification subproblems. This approach has the merits of using general classifiers, incorporating prior knowledge into task decomposition and parallel learning. Experiments on two real-world pattern classification problems, international patent classification and protein subcellar localization, demonstrate the effectiveness of the proposed approach.

Key words： imbalanced data; Min-Max modular network (M³-network); prior knowledge; parallel learning; support vector machine (SVM)

Cite this article

Bao-Liang LU , Xiao-Lin WANG , Yang YANG , Hai ZHAO . Learning from imbalanced data sets with a Min-Max modular support vector machine[J]. Frontiers of Electrical and Electronic Engineering, 0 , 6(1) : 56 -71 . DOI: 10.1007/s11460-011-0127-1

References

Publishing order | Descend order by publishing year | Descend order by cited within

1	He H B, Garcia E A. Learning from imbalanced data. IEEE Transaction on Knowledge and Data Engineering, 2009, 21(9): 1263-1284 DOI

2	Japkowicz N. Learning from imbalanced data sets. In: Proceedings of Workshops at the 17th National Conference on Artificial Intelligence, 2000

3	Chawla N V, Japkowicz N, Kolcz A. Workshop Learning from Imbalanced Data Sets II, Machine Learning, 2003

4	Chawla N V, Japkowicz N, Kolcz A. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 1-6 DOI

5	Lu B L, Wang K A, Utiyama M, Isahara H. A part-versuspart method for massively parallel training of support vector machines. In: Proceedings of IEEE/INNS International Joint Conference on Neural Networks. 2004, 735-740

6	Lu B L, Ito M. Task decomposition based on class relations: a modular neural network architecture for pattern classification. Lecture Notes in Computer Science, 1997, 1240: 330-339 DOI

7	Lu B L, Ito M. Task decomposition and module combination based on class relations: a modular neural network for pattern classification. IEEE Transactions on Neural Networks, 1999, 10(5): 1244-1256 DOI

8	Ye Z F, Wen Y M, Lu B L. A survey of imbalanced pattern classification problems. CAAI Transactions on Intelligent Systems, 2009: 148-156 (in Chinese)

9	Estabrooks A, Jo T, Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 2004, 20(1): 18-36 DOI

10	Laurikkala J. Improving identification of difficult small classes by balancing class distribution. In: Proceedings of the Conference on Artificial Intelligence in Medicine in Europe. 2001, 63-66 DOI

11	Weiss G M, Provost F. The effect of class distribution on classifier learning: an empirical study. Technical Report MLTR-43. 2001

12	Batista G E A P A, Prati R C, Monard M C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29 DOI

13	Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of International Conference on Machine Learning. 1997, 179-186

14	Zhang J, Mani I. KNN approach to unbalanced data distributions: a case study involving information extraction. In: Prceedings of International Conference on Machine Learning, Workshop Learning from Imbalanced Data Sets. 2003, 1-7

15	Liu X Y, Wu J, Zhou Z H. Exploratory under sampling for class imbalance learning. In: Proceedings of International Conference on Data Mining. 2006, 965-969

16	Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16(3): 321-357

17	Jo T, Japkowicz N. Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 40-49 DOI

18	Chawla N V, Lazarevic A, Hall L O, Bowyer K W. SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2003, 107-119

19	Guo H, Viktor H L. Learning from imbalanced data sets with boosting and data generation: the dataBoost IM approach. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 30-39 DOI

20	Mease D, Wyner A J, Buja A. Boosted classification trees and class probability/quantile estimation. Machine Learning Research, 2007, 8: 409-439

21	Maloof M A. Learning when data sets are imbalanced and when costs are unequal and unknown. In: Proceedings of International Conference on Machine Learning, Workshop Learning from Imbalanced Data Sets II. 2003, 1-8

22	Weiss G M. Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 7-19 DOI

23	Liu X Y, Zhou Z H. The influence of class imbalance on cost-sensitive learning: An empirical study. In: Proceedings of International Conference on Data Mining. 2006, 970-974

24	Liu X Y, Zhou Z H. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63-77 DOI

25	McCarthy K, Zabar B, Weiss G M. Does cost-sensitive learning beat sampling for classifying rare classes? In: Proceedings of International Workshop Utility-Based Data Mining. 2005, 69-77 DOI

26	Fan W, Stolfo S J, Zhang J, Chan P K. AdaCost: misclassification cost-sensitive boosting. In: Proceedings of International Conference on Machine Learning. 1999, 97-105

27	Sun Y, Kamel M S, Wong A K C, Wang Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 2007, 40(12): 3358-3378 DOI

28	Ting K M. A comparative study of cost-sensitive boosting algorithms. In: Proceedings of International Conference on Machine Learning. 2000, 983-990

29	Haykin S. Neural Networks: A Comprehensive Foundation. 2nd ed. New Jersey: Prentice-Hall, 1999

30	Kukar M Z, Kononenko I. Cost-sensitive learning with neural networks. In: Proceedings of the 13th European Conference on Artificial Intelligence. 1998, 445-449

31	Domingos P, Pazzani M. Beyond independence: Conditions for the optimality of the simple Bayesian classifier. In: Proceedings of the International Conference on Machine Learning. 1996, 105-112

32	Gama J. Iterative bayes. Theoretical Computer Science, 2003, 292(2): 417-430 DOI

33	Kohavi R, Wolpert D. Bias plus variance decomposition for zero-one loss functions. In: Proceedings of International Conference on Machine Learning. 1996, 275-283

34	Webb G R I, Pazzani M J. Adjusted probability naive Bayesian induction. In: Proceedings of the 11th Australian Joint Conference on Artificial Intelligence. 1998, 285-295

35	Drummond C, Holte R C. Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: Proceedings of the International Conference on Machine Learning. 2000, 239-246

36	Vapnik V N. The Nature of Statistical Learning Theory. Berlin: Springer, 1995

37	Joachims T. Making large-scale support vector machine learning practical. Advances in Kernel Methods: Support Vector Learning. Cambridge: MIT Press, 1998, 169-184

38	Joachims T. A support vector method for multivariate performance measures. In: Proceedings of the International Conference on Machine Learning. 2005, 377-384

39	Fan R E, Chen P H, Lin C J. LIBSVM: A library for support vector machines. http://www.csie.ntu.edu.tw/cjlin/libsvm/

40	Liu T Y, Yang Y M, Wan H, Zeng H J, Chen Z, Ma W Y. Support vector machines classification with a very largescale taxonomy. Journal of ACM Special Interest Group on Discovery and Data Mining Explorations, 2005, 7(1): 36-43

41	Yang Y M, Pedersen J O. A comparattive study on feature selection in text categorization. In: Proceedings of International Conference on Machine Learning. 1997, 187-196

42	Wu G, Chang E. Class-boundary alignment for imbalanced data set learning. In: Proceedings of International Conference on Data Mining, Workshop Learning from Imbalanced Data Sets II. 2003, 1-8

43	Wu G, Chang E Y. Aligning boundary in kernel space for learning imbalanced data set. In: Proceedings of International Conference on Data Mining. 2004, 265-272

44	Wu G, Chang E Y. KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(6): 786-795 DOI

45	Kang P, Cho S. EUS SVMs: Ensemble of under-sampled SVMs for data imbalance problems. Lecture Notes in Computer Science, 2006, 4232: 837-846 DOI

46	Liu Y, An A, Huang X. Boosting prediction accuracy on imbalanced data sets with SVM ensembles. Lecture Notes in Artificial Intelligence, 2006, 3918: 107-118

47	Vilarino F, Spyridonos P, Radeva P, Vitria J. Experiments with SVM and stratified sampling with an imbalanced problem: detection of intestinal contractions. Lecture Notes in Computer Science, 2005, 3687: 783-791 DOI

48	Wang B X, Japkowicz N. Boosting support vector mMachines for imbalanced data sets. Lecture Notes in Artificial Intelligence, 2008, 4994: 38-47

49	Abe N. Sampling approaches to learning from imbalanced data sets: active learning, cost sensitive learning and deyond. In: Proceedings of International Conference on Machine Learning, Workshop Learning from Imbalanced Data Sets II. 2003

50	Ertekin S, Huang J, Bottou L, Giles L. Learning on the border: active learning in imbalanced data classification. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management. 2007, 127-136

51	Ertekin S, Huang J, Giles C L. Active learning for class imbalance problem. In: Proceedings of International SIGIR Conference on Research and Development in Information Retrieval. 2007, 823-824

52	Provost F. Machine learning from imbalanced data sets 101. In: Proceedings of American Association Artificial Intelligence Workshop on Imbalanced Data Sets. 2000, 1-3

53	Lu B L, Wang X L, Utiyama M. Incorporating prior knowledge into learning by dividing training data. Frontiers of Computer Science in China, 2009, 3(1): 109-122 DOI

54	Lu B L, Ichikawa M. A Gaussian zero-crossing discriminant function for Min-Max modular neural networks. In: Proceedings of the 5th International Conference on Knowledge-Based Intelligent Information Engineering Systems and Allied Technologies. 2001, 298-302

55	Lu B L, Ichikawa M. Emergent on-line learning with a Gaussian zero-crossing discriminant function. In: Proceedings of IEEE/INNS International Joint Conference on Neural Networks. 2002, 2: 1263-1268

56	Lu B L, Li J. A Min-Max modular network with Gaussianzero-crossing function. In: Chen K, Wang L, eds. Trends in Neural Computation. Berlin: Springer, 2007, 285-313 DOI

57	Wang K A, Zhao H, Lu B L. Task decomposition using geometric relation for Min-Max modular SVMs. Lecture Notes in Computer Science, 2005, 3496: 887-892 DOI

58	Wen Y M, Lu B L, Zhao H. Equal clustering makes Min-Max modular support vector machine more efficient. In: Proceedings of the 12th International Conference on Neural Information Processing. 2005, 77-82

59	Cong C, Lu B L. Partition of sample space with perceptrons. Computer simulation, 2008, 25(2): 96-99 (in Chinese)

60	Ma C, Lu B L, Utiyama M. Incorporating prior knowledge into task decomposition for large-scale patent classification. In: Proceedings of 6th International Symposium on Neural Networks: Advances in Neural Network-Part II. 2009, 784-793

61	Zhao H, Lu B L. A modular k-nearest neighbor classification method for massively parallel text categorization. Lecture Notes in Computer Science, 2004, 3314: 867-872 DOI

62	Zhao H, Lu B L. Improvement on response performance of Min-Max modular classifier by symmetric module selection. Lecture Notes in Computer Science, 2005, 3497: 39-44 DOI

63	Lu B L, Wang X L. A parallel and modular pattern classification framework for large-scale problems. In: Chen C H, ed. Handbook of Pattern Recognition and Computer Vision. 4th ed. Singapore: World Scientific, 2009, 725-746 DOI

64	Fall C J, Törcsvári A, Benzineb K, Karetka G. Automated categorization in the international patent classification. ACM SIGIR Forum, 2003, 37(1): 10-25 DOI

65	Fujii A, Iwayama M, Kando N. Introduction to the special issue on patent processing. Information Processing and Management, 2007, 43(5): 1149-1153 DOI

66	Chu X L, Ma C, Li J, Lu B L, Utiyama M, Isahara H. Largescale patent classification with Min-Max modular support vector machines. In: Proceedings of IEEE/INNS International Joint Conference on Neural Networks. 2008, 3973-3980

67	Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1): 1-47 DOI

68	Cedano J, Aloy P, Pérez-Pons J A, Querol E. Relation between amino acid composition and cellular location of proteins. Journal of Molecular Biology, 1997, 266(3): 594-600 DOI

69	Chou K C, Shen H B. Review: recent progresses in protein subcellular location prediction. Analytical Biochemistry, 2007, 370(1): 1-16 DOI

70	Cai Y D, Chou K C. Predicting 22 protein localizations in budding yeast. Biochemical and Biophysical Research Communications, 2004, 323(2): 425-428 DOI

71	Yang Y, Lu B L. Prediction of protein subcellular multilocalization by using a Min-Max modular support vector machine. Advances in Computational Intelligence, Ascvances in Soft Computing, 2009, 116: 133-143 DOI

72	Zhang S, Xia X, Shen J, Zhou Y, Sun Z. DBMLoc: a data base of protein swith multiple subcellular localizations. BMC Bioinformatics, 2008, 9(1): 127 DOI

73	Gene Ontology Consortium. gene ontology: tool for the unification of biology. Nature Genetics, 2000, 25(1): 25-29 DOI

74	Chou K C, Cai Y D. Predicting protein localization in budding yeast. Bioinformatics, 2005, 21(7): 944-950 DOI

75	Wang J Z, Du Z, Payattakool R, Yu P S, Chen C F. A new method to measure the semantic similarity of GO terms. Bioinformatics, 2007, 23(10): 1274-1281 DOI

76	Karypis G. CLUTO-A Clustering Toolkit, Technical Report 02-017. 2002

77	Huh W K, Falvo J V, Gerke L C, Carroll A S, Howson R W, Weissman J S, O’Shea E K. Global analysis of protein localization in budding yeast. Nature, 2003, 425(6959): 686-691 DOI

Options

Outlines