Transfer synthetic over-sampling for class-imbalance learning with limited minority class data

Xu-Ying LIU , Sheng-Tao WANG , Min-Ling ZHANG

Front. Comput. Sci. ›› 2019, Vol. 13 ›› Issue (5) : 996 -1009.

PDF (492KB)
Front. Comput. Sci. ›› 2019, Vol. 13 ›› Issue (5) : 996 -1009. DOI: 10.1007/s11704-018-7182-1
RESEARCH ARTICLE

Transfer synthetic over-sampling for class-imbalance learning with limited minority class data

Author information +
History +
PDF (492KB)

Abstract

The problem of limited minority class data is encountered in many class imbalanced applications, but has received little attention. Synthetic over-sampling, as popular class-imbalance learning methods, could introduce much noise when minority class has limited data since the synthetic samples are not i.i.d. samples of minority class. Most sophisticated synthetic sampling methods tackle this problem by denoising or generating samples more consistent with ground-truth data distribution. But their assumptions about true noise or ground-truth data distribution may not hold. To adapt synthetic sampling to the problem of limited minority class data, the proposed Traso framework treats synthetic minority class samples as an additional data source, and exploits transfer learning to transfer knowledge from them to minority class. As an implementation, TrasoBoost method firstly generates synthetic samples to balance class sizes. Then in each boosting iteration, the weights of synthetic samples and original data decrease and increase respectively when being misclassified, and remain unchanged otherwise. The misclassified synthetic samples are potential noise, and thus have smaller influence in the following iterations. Besides, the weights of minority class instances have greater change than those of majority class instances to be more influential. And only original data are used to estimate error rate to be immune from noise. Finally, since the synthetic samples are highly related to minority class, all of the weak learners are aggregated for prediction. Experimental results show TrasoBoost outperforms many popular class-imbalance learning methods.

Keywords

machine learning / data mining / class imbalance / over sampling / boosting / transfer learning

Cite this article

Download citation ▾
Xu-Ying LIU, Sheng-Tao WANG, Min-Ling ZHANG. Transfer synthetic over-sampling for class-imbalance learning with limited minority class data. Front. Comput. Sci., 2019, 13(5): 996-1009 DOI:10.1007/s11704-018-7182-1

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

He H, Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263–1284

[2]

Liu X Y, Wu J, Zhou Z H. Exploratory undersampling for classimbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2009, 39(2): 539–550

[3]

Cieslak D, Chawla N. Learning decision trees for unbalanced data. In: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2008, 241–256

[4]

Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2012, 42(4): 463–484

[5]

Wang S, Minku L L, Yao X. Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5): 1356–1368

[6]

Yan Y, Chen M, Shyu M L, Chen S C. Deep learning for imbalanced multimedia data classification. In: Proceedings of the 2015 IEEE International Symposium on Multimedia. 2015, 483–488

[7]

Wang S, Liu W, Wu J, Cao L, Meng Q, Kennedy P J. Training deep neural networks on imbalanced data sets. In: Proceedings of the 2016 International Joint Conference on Neural Networks. 2016, 4368–4374

[8]

Fawcett T, Provost F J. Combining data mining and machine learning for effective user profiling. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 1996, 8–13

[9]

Kubat M, Holte R C, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 1998, 30(2–3): 195–215

[10]

Lewis D D, Ringuette M. A comparison of two learning algorithms for text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval. 1994, 81–93

[11]

Wang S, Yao X. Using class imbalance learning for software defect prediction. IEEE Transactions on Reliability, 2013, 62(2): 434–443

[12]

Bradley A P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 1997, 30(6): 1145–1159

[13]

Yang Q, Wu X. 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making, 2006, 5(4): 597–604

[14]

Weiss G M. Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 7–19

[15]

Weiss G M. Mining with Rare Cases. Data Mining and Knowledge Discovery Handbook, Springer, Boston, MA. 2005, 765–776

[16]

Khoshgoftaar T M, Seiffert C, Hulse J V, Napolitano A, Folleco A. Learning with limited minority class data. In: Proceedings of the 6th International Conference on Machine Learning and Applications. 2007, 348–353

[17]

Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321–357

[18]

Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new oversampling method in imbalanced data sets learning. In: Proceedings of the International Conference on Intelligent Computing. 2005, 878–887

[19]

Batista G E, Prati R C, Monard M C. A study of the behavior of several methods for balancing machine learning training data. ACM SGKDD Explorations Newsletter, 2004, 6(1): 20–29

[20]

Laurikkala J. Improving identification of difficult small classes by balancing class distribution. In: Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe. 2001, 63–66

[21]

He H, Bai Y, Garcia E A, Li S. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the 2008 IEEE International Joint Conference on Neural Networks. 2008, 1322–1328

[22]

Das B, Krishnan N C, Cook D J. wRACOG: a gibbs sampling-based oversampling technique. In: Proceedings of the 13th IEEE International Conference on Data Mining. 2013, 111–120

[23]

Zhang H, Li M. RWO-sampling: a random walk over-sampling approach to imbalanced data classification. Information Fusion, 2014, 20: 99–116

[24]

Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345–1359

[25]

Galar M, Fernández A, Barrenechea E, Herrera F. EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 2013, 46(12): 3460–3471

[26]

Ramentol E, Caballero Y, Bello R, Herrera F. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowledge and Information Systems, 2012, 33(2): 245–265

[27]

Wang S, Yao X. Multiclass imbalance problems: analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2012, 42(4): 1119–1130

[28]

Liu X Y, Li Q Q. Learning from combination of data chunks for multiclass imbalanced data. In: Proceedings of the 2014 International Joint Conference on Neural Networks. 2014, 1680–1687

[29]

Li S, Wang Z, Zhou G, Lee S Y M. Semi-supervised learning for imbalanced sentiment classification. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence. 2011, 1826–1832

[30]

Zhang M L, Li Y K, Liu X Y. Towards class-imbalance aware multilabel learning. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 4041–4047

[31]

Hoens T R, Chawla N V. Learning in non-stationary environments with class imbalance. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 168–176

[32]

Wang S, Minku L L, Yao X. Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5): 1356–1368

[33]

Cao H, Li X L, Woon Y K, Ng S K. SPO: structure preserving oversampling for imbalanced time series classification. In: Proceedings of the 11th IEEE International Conference on Data Mining. 2011, 1008–1013

[34]

Cao H, Li X L, Woon D Y K, Ng S K. Integrated oversampling for imbalanced time series classification. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(12): 2809–2822

[35]

Chawla N V, Lazarevic A, Hall L O, Bowyer K W. SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2003, 107–119

[36]

Wang S, Yao X. Diversity analysis on imbalanced data sets by using ensemble models. In: Proceedings of IEEE Symposium on Computational Intelligence and Data Mining. 2009, 324–331

[37]

Sun Y, Kamel M S, Wong A K, Wang Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 2007, 40(12): 3358–3378

[38]

Seiffert C, Khoshgoftaar T M, Van Hulse J, Napolitano A. RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A (Systems and Humans), 2010, 40(1): 185–197

[39]

Tomek I. Two modifications of CNN. IEEE Transactions of System Man Cybernetics, 1976, 6: 769–772

[40]

Raina R, Battle A, Lee H, Packer B, Ng A Y. Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 759–766

[41]

Wei Y, Zhu Y, Leung C W, Song Y, Yang Q. Instilling social to physical: co-regularized heterogeneous transfer learning. In: Proceedings of the 13th AAAI Conference on Artificial Intelligence. 2016, 1338–1344

[42]

Weiss K, Khoshgoftaar T M, Wang D. A survey of transfer learning. Journal of Big Data, 2016, 3(1): 1–40

[43]

Al-Stouhi S, Reddy C K. Transfer learning for class imbalance problems with inadequate data. Knowledge and Information Systems, 2016, 48(1): 201–208

[44]

Ge L, Gao J, Ngo H, Li K, Zhang A. On handling negative transfer and imbalanced distributions in multiple source transfer learning. Statistical Analysis and Data Mining, 2014, 7(4): 254–271

[45]

Dai W, Yang Q, Xue G R, Yu Y. Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 193–200

[46]

Blake C, Keogh E, Merz C J. UCI repository of machine learning databases. University of California, Irvine, CA, 1996

[47]

Breiman L, Friedman J, Olshen R A, Stone C J. Classification and Regression Trees. London: Routledge Press, 2017

[48]

Schapire R E. A brief introduction to Boosting. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999, 1401–1406

[49]

Barandela R, Valdovinos R M, Snchez J S. New applications of ensembles of classifiers. Pattern Analysis and Applications, 2003, 6(3): 245–256

[50]

Breiman L. Bagging predictors. Machine Learning, 1996, 24(2): 123–140

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature

AI Summary AI Mindmap
PDF (492KB)

Supplementary files

Article highlights

1191

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/