Evolutionary under-sampling based bagging ensemble method for imbalanced data classification

Bo SUN; Haiyan CHEN; Jiandong WANG; Hua XIE

doi:10.1007/s11704-016-5306-z

Front. Comput. Sci. ›› 2018, Vol. 12 ›› Issue (2) :331 -350. DOI: 10.1007/s11704-016-5306-z

RESEARCH ARTICLE

Evolutionary under-sampling based bagging ensemble method for imbalanced data classification

Bo SUN ¹^,²^,^†
, Haiyan CHEN ¹^,²^,^†
, Jiandong WANG ¹
, Hua XIE ²

Author information +

History +

PDF (549KB)

Abstract

In the class imbalanced learning scenario, traditional machine learning algorithms focusing on optimizing the overall accuracy tend to achieve poor classification performance especially for the minority class in which we are most interested. To solve this problem, many effective approaches have been proposed. Among them, the bagging ensemble methods with integration of the under-sampling techniques have demonstrated better performance than some other ones including the bagging ensemble methods integrated with the over-sampling techniques, the cost-sensitive methods, etc. Although these under-sampling techniques promote the diversity among the generated base classifiers with the help of random partition or sampling for the majority class, they do not take any measure to ensure the individual classification performance, consequently affecting the achievability of better ensemble performance. On the other hand, evolutionary under-sampling EUS as a novel undersampling technique has been successfully applied in searching for the best majority class subset for training a goodperformance nearest neighbor classifier. Inspired by EUS, in this paper, we try to introduce it into the under-sampling bagging framework and propose an EUS based bagging ensemble method EUS-Bag by designing a new fitness function considering three factors to make EUS better suited to the framework. With our fitness function, EUS-Bag could generate a set of accurate and diverse base classifiers. To verify the effectiveness of EUS-Bag, we conduct a series of comparison experiments on 22 two-class imbalanced classification problems. Experimental results measured using recall, geometric mean and AUC all demonstrate its superior performance.

Keywords

class imbalanced problem / under-sampling / bagging / evolutionary under-sampling / ensemble learning / machine learning / data mining

Cite this article

Download citation ▾

Bo SUN, Haiyan CHEN, Jiandong WANG, Hua XIE. Evolutionary under-sampling based bagging ensemble method for imbalanced data classification. Front. Comput. Sci., 2018, 12 (2) : 331-350 DOI:10.1007/s11704-016-5306-z

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Banfield R E, Hall L O, Bowyer K W, Kegelmeyer W P. A comparison of decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(1): 173–180

[2]	Donate J P, Cortez P, Sanchez G G, Miguel A S. Time series forecasting using a weighted cross-validation evolutionary artificial neural network ensemble. Neurocomputing, 2013, 109(1): 27–32

[3]	Niu D X, Wang Y L, Wu D D. Power load forecasting using support vector machine and ant colony optimization. Expert Systems with Applications, 2010, 37(3): 2531–2539

[4]	Rutkowski L, Jaworski M, Pietruczuk L, Duda P. The CART decision tree for mining data streams. Information Sciences, 2014, 266: 1–15

[5]	Bar-Hen A, Gey S, Poggi J M. Influence measures for CART classification trees. Journal of Classification, 2015, 32(1): 21–45

[6]	Mazurowski M A, Habas P A, Zurada J M, Lo J Y, Baker J A, Tourassi G D. Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Networks, 2008, 21(2): 427–436

[7]	Tomczak J M, Zieba M. Probabilistic combination of classification rules and its application to medical diagnosis. Machine Learning, 2015, 101(1–3): 105–135

[8]	Tavallaee M, Stakhanova N, Ghorbani A A. Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2010, 40(5): 516–524

[9]	Ngai EWT, Hu Y, Wong Y H, Chen Y J, Sun X. The application of data mining techniques in financial fraud detection: a classification framework and an academic review of literature. Decision Support Systems, 2011, 50(3): 559–569

[10]	Chang X J, Yu Y L, Yang Y, Hauptmann A G. Searching persuasively: joint event detection and evidence justification with limited supervision. In: Proceedings of the 23rd Annual ACM Conference on Multimedia. 2015, 581–590

[11]	Chang X J, Yang Y, Xing E P, Yu Y L. Complex event detection using semantic saliency and nearly-isotonic SVM. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 1348–1357

[12]	Chang X J, Yang Y, Hauptmann A G, Xing E P. Semantic concept discovery for large-scale zero-shot event detection. In: Proceedings of the 4th International Joint Conference on Artificial Intelligence. 2015

[13]	Bermejo P, Gámez J A, Puerta J M. Improving the performance of naive bayes multinomial in e-mail foldering by introducing distributionbased balance of datasets. Expert Systems with Applications, 2011, 38(3): 2072–2080

[14]	Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(4): 463–484

[15]	Nanni L, Fantozzi C, Lazzarini N. Coupling different methods for overcoming the class imbalance problem. Neurocomputing, 2015, 158(1): 48–61

[16]	Batista G E, Prati R C, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20–29

[17]	Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16(1): 321–357

[18]	Sáez J A, Luengo J, Stefanowski J, Herrera F. SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences, 2015, 291(1): 184–203

[19]	Estabrooks A, Jo T, Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 2004, 20(1): 18–36

[20]	He H B, Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263–1284

[21]	Drummond C, Holte R C. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the International Conference on Machine Learning, Workshop on Learning from Imbalanced Datasets II. 2003, 1–8

[22]	Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new oversampling method in imbalanced data sets learning. In: Proceedings of International Conference on Intelligent Computing. 2005, 878–887

[23]	Lin Y, Lee Y, Wahba G. Support vector machines for classification in nonstandard situations. Machine Learning, 2002, 46(1–3): 191–202

[24]	Wu G, Chang E Y. KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(6): 786–795

[25]	Barandela R, Sánchez J S, Garcia V, Rangel E. Strategies for learning in class imbalance problems. Pattern Recognition, 2003, 36(3): 849–851

[26]	Ling C X, Sheng V S, Yang Q. Test strategies for cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(8): 1055–1067

[27]	Zhou Z H, Liu X Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63–77

[28]	Chawla N V, Cieslak D A, Hall L O, Joshi A. Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery, 2008, 17(2): 225–252

[29]	Tao D C, Tang X O, Li X L, Wu X D. Asymmetric bagging and random subspace for support vector machines-based relevance feedback. IEEE Transactions on Pattern Analysis andMachine Intelligence, 2006, 28(7): 1088–1099

[30]	Wang S, Yao X. Diversity analysis on imbalanced data sets by using ensemble models. In: Proceedings of IEEE Symposium on Computational Intelligence and Data Mining. 2009, 324–331

[31]	Hido S, Kashima H, Takahashi Y. Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining, 2009, 2(5–6): 412–426

[32]	Liu X Y, Wu J X, Zhou Z H. Exploratory undersampling for classimbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2009, 39(2): 539–550

[33]	Seiffert C, Khoshgoftaar T M, Van Hulse J, Napolitano A. RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 2010, 40(1): 185–197

[34]	Barandela R, Valdovinos R M, Sánchez J S. New applications of ensembles of classifiers. Pattern Analysis and Applications, 2003, 6(3): 245–256

[35]	Khoshgoftaar T M, Van Hulse J, Napolitano A. Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 41(3): 552–568

[36]	Chawla N V, Lazarevic A, Hall L O, Bowyer K W. SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2003, 107–119

[37]	Zhou Z H. Ensemble Methods: Foundations and Algorithms. Florida: CRC Press, 2012

[38]	Sun B, Chen H Y, Wang J D. An empirical margin explanation for the effectiveness of DECORATE ensemble learning algorithm. Knowledge-Based Systems, 2015, 78(1): 1–12

[39]	Hsu KW, Srivastava J. Improving bagging performance through multialgorithm ensembles. Frontiers of Computer Science, 2012, 6(5): 498–512

[40]	Liu E, Zhao H, Guo F F, Liang J M, Tian J. Fingerprint segmentation based on an AdaBoost classifier. Frontiers of Computer Science, 2011, 5(2): 148–157

[41]	Yan Y, Xu Z W, Tsang I W, Long G, Yang Y. Robust semi-supervised learning through label aggregation. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016, 1–7

[42]	Rong W G, Peng B L, Ouyang Y X, Li C, Xiong Z. Structural information aware deep semi-supervised recurrent neural network for sentiment analysis. Frontiers of Computer Science, 2015, 9(2): 171–184

[43]	Zhou Z H. When semi-supervised learning meets ensemble learning. Frontiers of Electrical and Electronic Engineering, 2011, 6(1): 6–16

[44]	Breiman L. Bagging predictors. Machine Learning, 1996, 24(2): 123–140

[45]	Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997, 55(1): 119–139

[46]	Garcia S, Herrera F. Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evolutionary Computation, 2009, 17(3): 275–306

[47]	Garcia S, Derrac J, Cano J, Herrera F. Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 417–435

[48]	Luengo J, Fernández A, Garica S, Herrera F. Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Computing, 2011, 15(10): 1909–1936

[49]	Drown D J, Khoshgoftaar T M, Seliya N. Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Transactions on Systems, Man and Cybernetics: PART A – Systems and Humans, 2009, 39(5): 1097–1107

[50]	Galar M, Fernández A, Barrenechea E, Herrera F. EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 2013, 46(12): 3460–3471

[51]	Fawcett T. ROC graphs: notes and practical considerations for researchers. Machine Learning, 2004, 31(1): 1–38

[52]	Kuncheva L I, Whitaker C J. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 2003, 51(2): 181–207

[53]	Dietterich T G. Ensemble Learning. Cambridge: The MIT Press, 2002

[54]	Banfield R E, Hall L O, Bowyer K W, Kegelmeyer W P. Ensemble diversity measures and their application to thinning. Information Fusion, 2005, 6(1): 49–62

[55]	Man K F, Tang K S, Kwong S. Genetic Algorithms: Concepts and Designs. Berlin: Springer Science & Business Media, 2012

[56]	Sun Z B, Song Q B, Zhu X Y, Sun H L, Xu B W, Zhou Y M. A novel ensemble method for classifying imbalanced data. Pattern Recognition, 2015, 48(5): 1623–1637

[57]	He H B, Ma Y Q. Imbalanced Learning: Foundations, Algorithms, and Applications. New Jersey: John Wiley & Sons, 2013

[58]	Demšar J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 2006, 7(1): 1–30

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature

PDF (549KB)

Part of a collection:

Supplementary files

Supplementary Material

1488

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submission

Call for papers

Guidelines for authors

Download templates

Guidelines for reviewers

Abstract

Keywords

Cite this article

References

RIGHTS & PERMISSIONS