Benchmarking binary classification models on data sets with different degrees of imbalance

Ligang ZHOU; Kin Keung LAI

doi:10.1007/s11704-009-0027-1

PDF(838 KB)

Front. Comput. Sci. ›› 2009, Vol. 3 ›› Issue (2) : 205-216. DOI: 10.1007/s11704-009-0027-1

RESEARCH ARTICLE

Benchmarking binary classification models on data sets with different degrees of imbalance

Author information +

History +

Abstract

In practice, there are many binary classification problems, such as credit risk assessment, medical testing for determining if a patient has a certain disease or not, etc. However, different problems have different characteristics that may lead to different difficulties of the problem. One important characteristic is the degree of imbalance of two classes in data sets. For data sets with different degrees of imbalance, are the commonly used binary classification methods still feasible? In this study, various binary classification models, including traditional statistical methods and newly emerged methods from artificial intelligence, such as linear regression, discriminant analysis, decision tree, neural network, support vector machines, etc., are reviewed, and their performance in terms of the measure of classification accuracy and area under Receiver Operating Characteristic (ROC) curve are tested and compared on fourteen data sets with different imbalance degrees. The results help to select the appropriate methods for problems with different degrees of imbalance.

Keywords

binary classification / area under Receiver Operating Characteristic (ROC) curve / classification accuracy / degrees of imbalance

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Ligang ZHOU, Kin Keung LAI. Benchmarking binary classification models on data sets with different degrees of imbalance. Front Comput Sci Chin, 2009, 3(2): 205‒216 https://doi.org/10.1007/s11704-009-0027-1

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Japkowicz N. Learning from imbalanced data sets: a comparison of various strategies. In: Proceedings of the AAAIWorkshop on learning from imbalanced data sets. Tech. rep. WS–00–05, Menlo Park, CA: AAAI Press, 2000

[2]	Nitesh V C, Nathalie J, Aleksander K. Editorial: special issue on learning from imbalanced data sets. In: Proceedings of the ACM Special Interest Group on Knowledge Discovery and Data Mining Explorations, Newsletter, 2004, 6(1): 1-63

[3]	Baesens B, Gestel T V, Viaene S, . Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society, 2003, 54(6): 627-635 CrossRef Google scholar

[4]	Bishop C M. Pattern Recognition and Machine Learning, New York: Springer, 2006

[5]	Rosenberg E, Gleit A. Quantitative methods in credit management: a survey. Operations research, 42(4): 589-613 CrossRef Google scholar

[6]	Zhou L G. A study on credit scoring with support vector machines, PhD thesis, City University of Hong Kong, 2008

[7]	Quinlan J R. Induction of Decision Trees. Machine Learning, 1986, 1(1): 81-106 CrossRef Google scholar

[8]	Quinlan J R. C4.5: Programs for Machine Learning, Los Altos: Morgan Kaufmann, 1993

[9]	Breiman L, Friedman J, Olshen R, . Classification and Regression Trees, Wadsworth Statistics/Probability Series Belmont, CA: Wadsworth, 1980

[10]	Maimon O, Rokach L. Data Mining and Knowledge Discovery Handbook, New York: Springer, 2005 CrossRef Google scholar

[11]	Gurney K. An Introduction to Neural Networks, London: UCL Press, 1997

[12]	Donald F S. Probabilistic neural networks, Neural Networks, 1990, 3(1): 109-118 CrossRef Google scholar

[13]	Demuth H, Beale M, Hagan M. Neural Network Toolbox 6 User’s Guide, Mathworks, 2008

[14]	Vapnik, V N. Statistical Learning Theory. New York: Springer-Verlag, 1998

[15]	Suykens J A K, Gestel T V, Brabanter J D, . Least Squares Support Vector Machines, Singapore: Singapore World Scientific Pub. Co., 2002

[16]	Lai K K, Yu L, Zhou L, . Credit risk evaluation with least square support vector machine, Lecture Notes in Computer Sciences, 4062: 490-495, 2006 CrossRef Google scholar

[17]	Zhou L, Lai K K, Yu L. Credit scoring using support vector machines with direct search for parameters selection. Soft Computing – A Fusion of Foundations, Methodologies and Applications, 13(2): 149-155, 2009

[18]	Freund Y, Robert E S. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 1999, 14(5): 771-780

[19]	Rakotomamonjy A. Optimizing area under ROC curve with SVMs. In: Proceedings of the Workshop on ROC Analysis in Artificial Intelligence, 2004

[20]	Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters, 2006, 27(8): 861-874 CrossRef Google scholar

[21]	Vezhnevets A, Vezhnevets V. Modest AdaBoost – Teaching AdaBoost to Generalize Better. In: Proceedings of the Graphicon-2005. Novosibirsk Akademgorodok, Russia, 2005

RIGHTS & PERMISSIONS

2014 Higher Education Press and Springer-Verlag Berlin Heidelberg

AI Summary AI Mindmap

PDF(838 KB)

Accesses

Citations

Detail

Sections

Recommended

Received	Accepted	Published
15 Oct 2008	08 Feb 2009	05 Jun 2009
Issue Date
05 Jun 2009

About the journal

Aims & scope

Description

Editorial board

Abstracting / Indexing

Contact us

Browse

Just accepted

Online first

Latest issue

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submisson

Call for papers

Guidelines for authors

Download templates