A Hybrid Evolutionary Under-sampling Method for Handling the Class Imbalance Problem with Overlap in Credit Classification

Ping Gong , Junguang Gao , Li Wang

Journal of Systems Science and Systems Engineering ›› 2022, Vol. 31 ›› Issue (6) : 728 -752.

PDF
Journal of Systems Science and Systems Engineering ›› 2022, Vol. 31 ›› Issue (6) : 728 -752. DOI: 10.1007/s11518-022-5545-5
Article

A Hybrid Evolutionary Under-sampling Method for Handling the Class Imbalance Problem with Overlap in Credit Classification

Author information +
History +
PDF

Abstract

Credit risk assessment is an important task of risk management for financial institutions. Machine learning-based approaches have made promising progress in credit risk assessment by treating it as imbalanced binary classification tasks. However, few efforts have been made to deal with the class overlap problem that accompanies imbalances simultaneously. To this end, this study proposes a Tomek link and genetic algorithm (GA)-based under-sampling framework (TEUS) to address the class imbalance and overlap issues in binary credit classification by eliminating majority class instances with considering multi-perspective factors. TEUS first determines boundary majority instances with Tomek link, then take the distance from each majority instance to its nearest boundary as the radius and assigns the density of opposite class samples within the radius as the overlap potential of that majority instance. Second, TEUS weighs each non-borderline majority instance based on its information contribution in estimating class labels. After partitioning non-borderline majority instances into subgroups according to overlap potential and information contribution, TEUS applies GA to select samples from subgroups and merge them with the minority samples into a new training set. Innovatively, the design of the fitness function in GA and the grouping of the non-borderline majority not only trade off the multi-perspective characteristics of instances but also help reduce the computational complexity of the sampling optimization search. Numerical experiments on real-world credit data sets demonstrate the effectiveness of the proposed TEUS.

Keywords

Imbalance classification / credit classification / class overlap / evolutionary under-sampling / genetic algorithm

Cite this article

Download citation ▾
Ping Gong, Junguang Gao, Li Wang. A Hybrid Evolutionary Under-sampling Method for Handling the Class Imbalance Problem with Overlap in Credit Classification. Journal of Systems Science and Systems Engineering, 2022, 31(6): 728-752 DOI:10.1007/s11518-022-5545-5

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Alcalá-Fdez J, Fernández A, Luengo J, Derra J, García S, Sánchez L, Herrera F. KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 2011, 17(2–3): 255-287.

[2]

Batista G, Prati R C, Monard M C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.

[3]

Branco P, Torgo L, Ribeiro R P. Pre-processing approaches for imbalanced distributions in regression. Neurocomputing, 2019, 343: 76-99.

[4]

Bunkhumpornpat C, Sinapiromsaran K. DB-MUTE: density-based majority under-sampling technique. Knowledge and Information Systems, 2017, 50(3): 827-850.

[5]

Chawla N W, Bowyer K O, Hall L, Kegelmeyer W P. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 2002, 16: 321-357.

[6]

Crone S, Finlay S. Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting, 2012, 28(1): 224-238.

[7]

Das S, Datta S, Chaudhuri B. Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recognition, 2018, 81: 674-693.

[8]

Dastile X, Celik T, Potsane M. Statistical and machine learning models in credit scoring: A systematic literature survey. Applied Soft Computing Journal, 2020, 91: 106263.

[9]

Devi D, Biswas S, Purkayastha B. Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance. Pattern Recognition Letters, 2017, 93: 1339-1351.

[10]

Du G, Elston F (2022). Financial risk assessment to improve the accuracy of financial prediction in the internet financial industry using data analytics models. Operations Management Research: 0123456789.

[11]

Fernandes E, Carvalho A. Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning. Information Sciences, 2019, 494: 141-154.

[12]

Galar M, Fernández A, Barrenechea E, Herrera F. EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 2013, 46(12): 3460-3471.

[13]

García S, Herrera F. Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy. Evolutionary Computation, 2009, 17(3): 275-306.

[14]

García V, Mollineda R, Sánchez J S. On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Analysis and Applications, 2008, 11(3): 269280.

[15]

Goldberg D (1989). Genetic algorithms in search. Optimization, and machine learning. Addion Wesley, 102(36).

[16]

Guo H, Li Y, Shang J, Gu M, Huang Y, Bing G. Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 2017, 73: 220-239.

[17]

Guzmán-Ponce A, Sánchez J, Valdovinos R, Marcial-Romero J. DBIG-US: A two-stage under-sampling algorithm to face the class imbalance problem. Expert Systems with Applications, 2021, 168: 114301.

[18]

He H, Bai Y, Garcia E, Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In IEEE International Joint Conference on Neural Networks, 2008, 2008(3): 1322-1328.

[19]

Huang X, Liu X, Ren Y. Enterprise credit risk evaluation based on neural network algorithm. Cognitive Systems Research, 2018, 52: 317-324.

[20]

Junior L, Nardini F, Renso C, Trani R, Macedo J. A novel approach to define the local region of dynamic selection techniques in imbalanced credit scoring problems. Expert Systems with Applications, 2020, 152: 113351.

[21]

Khan S, Madden M. One-class classification: taxonomy of study and review of techniques. The Knowledge Engineering Review, 2014, 29(3): 345-374.

[22]

Kovács G. An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Applied Soft Computing Journal, 2019, 83: 105662.

[23]

Lam H, Landa-silva D, Galar M, Garcia S, Triguero I. EUSC: A clustering-based surrogate model to accelerate evolutionary undersampling in imbalanced classification. Applied Soft Computing Journal, 2021, 101: 107033.

[24]

Lee H, Kim S. An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Systems with Applications, 2018, 98: 72-83.

[25]

Li G, Ma H, Liu R, Shen M, Zhang K. A two-stage hybrid default discriminant model based on deep forest. Entropy, 2021, 23(5): 1-21.

[26]

Li M, Xiong A, Wang L, Deng S, Ye J. ACO Resampling: Enhancing the performance of oversampling methods for class imbalance classification. Knowledge-Based Systems, 2020, 196: 105818.

[27]

Li Z, Huang M, Liu G, Jiang C. A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection. Expert Systems with Applications, 2021, 175(February): 114750.

[28]

Liu W, Fan H, Xia M. Multi-grained and multi-layered gradient boosting decision tree for credit scoring. Applied Intelligence, 2022, 52(5): 5325-5341.

[29]

López V, Fernández A, García S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 2013, 250: 113-141.

[30]

Lorena A C, Garcia L P F, Lehmann J, Souto M, Ho TK. How complex is your classification problem? A survey on measuring classification complexity. ACM Computing Surveys, 2019, 52(5): 1-34.

[31]

Luengo J, Fernández A, García S, Herrera F. Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Computing, 2011, 15(10): 1909-1936.

[32]

Mercier M, Santos MS, Abreu PH, Soares C, Soares JP, Santos J. (2018). Analysing the footprint of classifiers in overlapped and imbalanced contexts. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11191 LNCS, 200–212.

[33]

Napierala K, Stefanowski J. Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems, 2016, 46(3): 563-597.

[34]

Niu K, Zhang Z, Liu Y, Li R. Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending. Information Sciences, 2020, 536: 120-134.

[35]

Oreski S, Oreski G. Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Systems with Applications, 2014, 41: 2052-2064.

[36]

Papouskova M, Hajek P. Two-stage consumer credit risk modelling using heterogeneous ensemble learning. Decision Support Systems, 2019, 118: 33-45.

[37]

Roshan SE, Asadi S. Improvement of Bagging performance for classification of imbalanced datasets using evolutionary multi-objective optimization. Engineering Applications of Artificial Intelligence, 2020, 87: 103319. October 2019)

[38]

Sáez JA, Luengo J, Stefanowski J, Herrera F. SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a resampling method with filtering. Information Sciences, 2015, 291: 184-203.

[39]

Santos M S, Abreu PH, Japkowicz N, Fernández A, Soares C, Wilk S, Santos J. (2022). On the joint-effect of class imbalance and overlap: a critical review. Artificial Intelligence Review: 1–69.

[40]

Sun B, Chen H, Wang J, Xie H. Evolutionary under-sampling based bagging ensemble method for imbalanced data classification. Frontiers of Computer Science, 2018, 12(2): 331-350.

[41]

Sun J, Lang J, Fujita H, Li H. Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Information Sciences, 2018, 425: 76-91.

[42]

Sun J, Li H, Fujita H, Fu B, Ai W. Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Information Fusion, 2020, 54: 128-144. December 2018)

[43]

Thabtah F, Hammoud S, Kamalov F, Gonsalves A. Data imbalance in classification: Experimental evaluation. Information Sciences, 2020, 513: 429-441.

[44]

Tomek I. Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, 1976, SMM-6(11): 769-772.

[45]

Vorraboot P, Rasmequan S, Chinnasarn K, Lursinsap C. Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms. Neurocomputing, 2015, 152: 429-443.

[46]

Vuttipittayamongkol P, Elyan E. Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Information Sciences, 2020, 509: 47-70.

[47]

Vuttipittayamongkol P, Elyan E, Petrovski A. On the class overlap problem in imbalanced data classification. Knowledge-Based Systems, 2021, 212: 106631.

[48]

Vuttipittayamongkol P, Elyan E, Petrovski A, Jayne C (2018). Overlap-Based Undersampling for Improving Imbalanced Data Classification. Lecture Notes in Computer Science 11314 LNCS: 689–697.

[49]

Wang Z, Wang B, Cheng Y, Li D, Zhang J. Cost-sensitive Fuzzy Multiple Kernel Learning for imbalanced problem. Neurocomputing, 2019, 366: 178-193.

[50]

Wojciechowski S, Wilk S. Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data. Foundations of Computing and Decision Sciences, 2017, 42(2): 149-176.

[51]

Wu Y, Xu Y, Li J. Feature construction for fraudulent credit card cash-out detection. Decision Support Systems, 2019, 127(September): 113155.

[52]

Xia Y, Guo X, Li Y, He L, Chen X. Deep learning meets decision trees: An application of a heterogeneous deep forest approach in credit scoring for online consumer lending. Journal of Forecasting, 2022, January: 1-22.

[53]

Xia Y, Liu C, Da B, Xie F. A novel heterogeneous ensemble credit scoring model based on bstacking approach. Expert Systems with Applications, 2018, 93: 182-199.

[54]

Yan Y, Jiang Y, Zheng Z, Yu C, Zhang Y, Zhang Y. LDAS: Local density-based adaptive sampling for imbalanced data classification. Expert Systems with Applications, 2022, 191: 116213.

[55]

Ye X, Li H, Imakura A, Sakurai T. An oversampling framework for imbalanced classification based on Laplacian eigenmaps. Neurocomputing, 2020, 399: 107-116.

[56]

Yu L, Zhou R, Tang L, Chen R. A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data. Applied Soft Computing Journal, 2018, 69: 192-202.

[57]

Zhu Y, Yan Y, Zhang Y, Zhang Y. EHSO: Evolutionary Hybrid Sampling in overlapping scenarios for imbalanced learning. Neurocomputing, 2020, 417: 333-346.

AI Summary AI Mindmap
PDF

172

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/