PDF
Abstract
Achieving higher true positive rate when decreasing false positive rate is always a great challenge to the imbalance learning community. This work combines penalized empirical likelihood method, lower bound algorithm and Nyström method and applies these techniques along with kernel method to density ratio model. The resulting classifier, density ratio classifier (DRC), is a combination of kernelization, regularization, efficient implementation and threshold moving, all of which are critical to enable DRC to be an effective and powerful method for solving difficult imbalance problems. Compared with other methods, DRC is competitive in that it is widely applicable and it is simple and easy to use without additional imbalance handling skills. In addition, the convergence rate of the estimate of log density ratio is discussed as well. And the results of numerical analysis also show that DRC outperforms other methods in AUC and G-mean score.
Keywords
Classifier
/
Density ratio model
/
Imbalance problems
/
Kernel method
/
ROC curve
Cite this article
Download citation ▾
Junjun Li, Wenquan Cui.
A New Classifier for Imbalanced Data Based on a Generalized Density Ratio Model.
Communications in Mathematics and Statistics, 2023, 11(2): 369-401 DOI:10.1007/s40304-021-00254-7
| [1] |
Barreno, M., Cárdenas, A.A., Tygar, J.D.:. Optimal roc curve for a combination of classifiers. In: In Advances in Neural Information Processing Systems (NIPS) (2007)
|
| [2] |
Berlinet, A.: Reproducing kernels in probability and statistics. More Progresses In Analysis (2014)
|
| [3] |
Böhning D. Multinomial logistic regression algorithm. Ann. Inst. Stat. Math.. 1992, 44 1 197-200
|
| [4] |
Böhning Dankmar. Lindsay, Bruce: Monotonicity of quadratic-approximation algorithms. Ann. Inst. Stat. Math.. 1988, 40 4 641-663
|
| [5] |
Breiman L. Random forest. Mach. Learn.. 2001, 45 5-32
|
| [6] |
Cai Song. Chen, Jiahua: Empirical likelihood inference for multiple censored samples. Canadian J. Stat.. 2018, 46 2 212-232
|
| [7] |
Chawla, Nitesh V., Bowyer, Kevin W., Hall, Lawrence O., Kegelmeyer, W. Philip.: Smote: synthetic minority over-sampling technique. J. Artif. Intel. Res. 16(1), 321–357 (2002)
|
| [8] |
Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDDInternationalConference onKnowledgeDiscovery andDataMining,KDD’16, pp. 785-794. Association for Computing Machinery (2016)
|
| [9] |
Chen Jiahua. Liu, Yukun: Quantile and quantile-function estimations under density ratio model. Ann. Stat.. 2013, 41 3 1669-1692
|
| [10] |
Chen Jiahua. Liu, Yukun: Small area quantile estimation. Int. Stat. Rev.. 2019, 87 S1 S219-S238
|
| [11] |
Chen, Baojiang, Li, Pengfei, Qin, Jing, Tao, Yu.: Using a monotonic density ratio model to find the asymptotically optimal combination ofmultiple diagnostic tests. J.Am. Stat.Assoc. 111(514), 861-874 (2016)
|
| [12] |
Cheng KF, Chu CK. Semiparametric density estimation under a two-sample density ratio model. Bernoulli. 2004, 10 4 583-604
|
| [13] |
Collell Guillem, Prelec Drazen, Patil Kaustubh R. A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data. Neurocomputing. 2018, 275 330-340
|
| [14] |
Cortes Corinna, Vapnik Vladimir. Support-vector networks. Mach. Learn.. 1995, 20 3 273-297
|
| [15] |
de Oliveira V, Kedem B. Bayesian analysis of a density ratio model. Canadian J. Stat.. 2017, 45 274-289
|
| [16] |
Denil M, Trappenberg T. Farzindar A, Kešelj V. Overlap versus imbalance. Advances in Artificial Intelligence. 2010 Berlin Heidelberg, Berlin: Springer. 220-231
|
| [17] |
Diao Guoqing. Ning, Jing, qin, jing: Maximum likelihood estimation for semiparametric density ratio model. Int. J. Biostat.. 2012, 8 1 1-29
|
| [18] |
Dua, D., Graff, C.: UCI machine learning repository (2017)
|
| [19] |
Eguchi Shinto. Copas, John: A class of logistic type discriminant functions. Biometrika. 2002, 89 1 1-22
|
| [20] |
Fokianos Konstantinos. Kaimi, Irene: On the effect of misspecifying the density ratio model. Ann. Inst. Stat. Math.. 2006, 58 3 475-497
|
| [21] |
Gu C. Smoothing Spline ANOVA Models. 2013 New York: Springer
|
| [22] |
Härdle Wolfgang. Nonparametric and Semiparametric Models. 2004 Berlin: Springer
|
| [23] |
He H, Garcia EA. Learning fromimbalanced data. IEEE Trans. Knowl. Data Eng.. 2009, 21 9 1263-1284
|
| [24] |
Ho T, Basu M. Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intel.. 2002, 24 289-300
|
| [25] |
Jing Qin. Inferences for case-control and semiparametric two-sample density ratiomodels. Biometrika. 1998, 85 3 619-630
|
| [26] |
Jing Qin. Biao, Zhang: Best combination of multiple diagnostic tests for screening purposes. Stat. Med.. 2010, 29 28 2905-2919
|
| [27] |
Kanamori, T., Suzuki, T., Sugiyama,M.: Theoretical analysis of density ratio estimation. IEICE Trans. Fundament. Electron. Commun. Comput. Sci. E93A(4), 787-798 (2010)
|
| [28] |
Karsmakers, P., Pelckmans, K., Suykens, J.A.K.: Multi-class kernel logistic regression: a fixed-size implementation. In: 2007 International Joint Conference on Neural Networks, pp. 1756-1761 (2007)
|
| [29] |
Katzoff, Myron, Zhou, Wen, Khan, Diba, Guanhua, Lu., Kedem, Benjamin: Out-of-sample fusion in risk prediction. J. Stat. Theory Practice 8(3), 444-459 (2014)
|
| [30] |
Kedem, Benjamin,Guanhua, Lu., Rong,Wei,Williams, PaulD.: Forecastingmortality rates via density ratio modeling. Canadian J. Stat. 36(2), 193-206 (2010)
|
| [31] |
Kedem Benjamin, Pan Lemeng, Zhou Wen, Coelho Carlos A. Interval estimation of small tail probabilities - applications in food safety. Stat. Med.. 2016, 35 18 3229-3240
|
| [32] |
Kedem, B., Pan, L., Smith, P., Wang, C.: Repeated out of sample fusion in the estimation of small tail probabilities (2019)
|
| [33] |
Kernels and Reproducing Kernel Hilbert Spaces, pp. 110-163. Springer, New York (2008)
|
| [34] |
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. man Cybern. Part B, Cybern. Publ. IEEE Syst. Man Cybern. Soc. 39(2), 539-550 (2009)
|
| [35] |
Liu, X.-Y., Wu, J., Zhou, Z.-H.: Exploratory undersampling for class-imbalance learning. Syst. Man Cybern. Part B: Cybern. IEEE Trans. 39, 539-550 (2009)
|
| [36] |
López Victoria. Fernández, Alberto, García, Salvador, Palade, Vasile, Herrera, Francisco: An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci.. 2013, 250 113-141
|
| [37] |
Luengo Julián. Fernández, Alberto, García, Salvador, Herrera, Francisco: Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft. Comput.. 2011, 15 10 1909-1936
|
| [38] |
Luo X, Tsai W. A proportional likelihood ratio model. Biometrika. 2011, 99 1 1
|
| [39] |
Maalouf Maher, Homouz Dirar, Trafalis Theodore B. Logistic regression in large rare events and imbalanced data: A performance comparison of prior correction and weighting methods. Comput. Intell.. 2018, 34 1 161-174
|
| [40] |
Prati R, Batista G, Monard M-C. Bazzan ALC, Labidi S. Learning with class skews and small disjuncts. ), Advances in Artificial Intelligence-SBIA 2004. 2004 Berlin Heidelberg, Berlin: Springer. 296-306
|
| [41] |
Quionero-Candela Joaquin, Sugiyama Masashi, Schwaighofer Anton, Lawrence Neil D. Dataset Shift in Machine Learning. 2009 Cambridge: The MIT Press
|
| [42] |
Rayhan, F., Ahmed, S., Mahbub, A., Jani, R., Shatabda, S., Farid, D.Md.: Cusboost: Cluster-based under-sampling with boosting for imbalanced classification. In: 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), Dec (2017)
|
| [43] |
Schölkopf B, Herbrich R, Smola AJ. Helmbold D, Williamson B. A generalized representer theorem. Computational Learning Theory. 2001 Berlin Heidelberg: Springer. 416-426
|
| [44] |
Seiffert, C., Khoshgoftaar, T., Hulse, J., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. Syst. Man Cybern. Part A: Syst. Humans, IEEE Trans. 40, 185-197 (2010)
|
| [45] |
Shen Y, Ning J, Qin J. Likelihood approaches for the invariant density ratio model with biased648 sampling data. Biometrika. 2012, 99 2 363-378
|
| [46] |
Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: Svms modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B, Cybern. Publ. IEEE Syst. Man Cybern. Soc. 39(1), 281-288 (2009)
|
| [47] |
Tom F. Introduction to roc analysis. Pattern Recogn. Lett.. 2006, 27 861-874
|
| [48] |
Trevor H, Robert T, Jerome F. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2009 New York, NY: Springer
|
| [49] |
Voulgaraki Anastasia, Kedem Benjamin, Graubard Barry I. Semiparametric regression in testicular germ cell data. Ann. Appl. Stat.. 2012, 6 3 1185-1208
|
| [50] |
Vuttipittayamongkol, P., Elyan, E., Petrovski, A., Jayne, C.: Overlap-based undersampling for improv613 ing imbalanced data classification. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 689-697 (2018)
|
| [51] |
Wang H.: Logistic regression for massive data with rare events (2020)
|
| [52] |
Wang Y. Smoothing Splines: Methods and Applications. 2011 1 Boca Raton, FL: CRC Press
|
| [53] |
Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE Symposium on Computational Intelligence and Data Mining, pp. 324-331 (2009)
|
| [54] |
Wang Dongliang. Tian, Lili, Zhao, Yichuan: Smoothed empirical likelihood for the youden index. Comput. Stat. Data Anal.. 2017, 115 1-10
|
| [55] |
Weiss, G.M.: The Impact of Small Disjuncts on Classifier Learning, vol. 8, pp. 193-226. Springer, US (2010)
|
| [56] |
Williams, C., Seeger, 677 M.: Using the nyström method to speed up kernel machines. In: Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS) 2000, Denver, CO, USA, pp. 682-688, 01 (2000)
|
| [57] |
Xia, S.Y., Xiong, Z.Y., 620 He, Y., Li, K., Dong, L.M., Zhang, M.: Relative density-based classification noise detection. Optik - Int. J. Light Electron Optics 125, 6829-6834 (2014)
|
| [58] |
Yijing, Li., Haixiang, Guo, Xiao, Liu, Yanan, Li., Jinling, Li.: Adapted ensemble classification algo587 rithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowl.-Based Syst. 94, 88-104 (2016)
|
| [59] |
Zhou Z-H, Liu X-Y. On multi-class cost-sensitive learning. Comput. Intel.. 2010, 26 232-257
|
| [60] |
Zhuang WW, Hu BY, Chen J. Semiparametric inference for the dominance index under the density ratio model. Biometrika. 2019, 106 229-241
|
Funding
Innovative Research Group Project of the National Natural Science Foundation of China(71873128)