Comparison of Racog and Racog-Rus for Classifying Imbalanced Data on Gradient Boosting and Naïve Bayes Performance

Rahmi Fadhilah , Heri Kuswanto , Dedy Dwi Prastyo , Dinda Ayu Safira , and Muhammad Yahya Matdoan

Journal of Modern Applied Statistical Methods ›› 2025, Vol. 24 ›› Issue (1) : 6

PDF (1087KB)
Journal of Modern Applied Statistical Methods ›› 2025, Vol. 24 ›› Issue (1) :6 DOI: 10.56801/Jmasm.V24.i1.6
research-article
Comparison of Racog and Racog-Rus for Classifying Imbalanced Data on Gradient Boosting and Naïve Bayes Performance
Author information +
History +
PDF (1087KB)

Abstract

This study aims to determine the effect of resampling RACOG and RACOG-RUS data on Gradient Boosting and Naïve Bayes classification in predicting water quality with unbalanced data. The data used in this study were 720 data from January 2022 to December 2023. It was found that Gradient Boosting performed best when using RACOG-RUS resampling data and feature selection with a number of numIntances of 200. While Naïve Bayes has the best performance when using RACOG-RUS resampling data without feature selection with a number of numIntances of 300. It can be seen that resampling RACOG data does not outperform RACOG-RUS in both classification models because it is known that the data generated in RACOG does not make the dataset more balanced than RACOG-RUS. Hybrid sampling is necessary if RACOG samples are used as the training dataset.

Keywords

gradient boosting / naive bayes / RACOG / RACOG-RUS

Cite this article

Download citation ▾
Rahmi Fadhilah, Heri Kuswanto, Dedy Dwi Prastyo, Dinda Ayu Safira, and Muhammad Yahya Matdoan. Comparison of Racog and Racog-Rus for Classifying Imbalanced Data on Gradient Boosting and Naïve Bayes Performance. Journal of Modern Applied Statistical Methods, 2025, 24(1): 6 DOI:10.56801/Jmasm.V24.i1.6

登录浏览全文

4963

注册一个新账户 忘记密码

Author Contributions

R.F.: conceptualization, methodology, software, writing—original draft preparation, visualization; H.K.: conceptualization, methodology, validation, supervision, writing—review and editing; D.D.P.: conceptualization, methodology, validation, supervision, writing—review and editing; D.A.S.: data curation, investigation, formal analysis; M.Y.M.: software support, validation, visualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Indonesian Ministry of Education, Culture, Research, and Technology through the PMDSU Research Grant No. 038/E5/PG.02.00.PL/2024.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request. Due to privacy and ethical restrictions related to the data, public sharing is limited. However, the authors are committed to providing access to the data to qualified researchers upon reasonable request, ensuring compliance with confidentiality agreements and legal requirements.

Acknowledgments

The authors gratefully acknowledge the financial support from the Indonesian Ministry of Education, Culture, Research, and Technology through the PMDSU Research Grant No. 038/E5/PG.02.00.PL/2024. We also thank Institut Teknologi Sepuluh Nopember, Pattimura University, and the Ministry of Environment for their valuable support and collaboration throughout this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

[1]

Khan M.S.I.; Islam N.; Uddin J.; et al. Water quality prediction and classification based on principal component regression and gradient boosting classifier approach. J. King Saud. Univ. Comput. Inf. Sci. 2022, 34, 4773-4781. https://doi.org/10.1016/j.jksuci.2021.06.003.

[2]

Azhar S.C.; Aris A.Z.; Yusoff M.K.; et al. Classification of River Water Quality Using Multivariate Analysis. Procedia Environ. Sci. 2015, 30, 79-84. https://doi.org/10.1016/j.proenv.2015.10.014.

[3]

Abo-Zahhad M. Design of Smart Wearable System for Sleep Tracking Using SVM and Multi-Sensor Approach. J. Eng. Sci. 2023, 51, 1-15. https://doi.org/10.21608/jesaun.2023.205964.1220.

[4]

Hanisah N.; Malek A.; Fairos W.; et al. Performance Evaluation of Classification Methods with Hybrid Sampling for Imbalanced Data: A Comparative Simulation Study. SSRN 2023. Available online: https://ssrn.com/abstract=4519776. (accessed on 11 July 2023).

[5]

Ramyachitra D.; Manikandan P. Imbalanced Dataset Classification and Solutions: A Review. Int. J. Comput. Bus. Res. 2014, 5, 1-29.

[6]

Spelmen V.S.; Porkodi R. A Review on Handling Imbalanced Data. In Proceedings of the International Conference on Current Trends towards Converging Technologies (ICCTCT), Coimbatore, India, 1-3 March 2018; pp. 1-11. https://doi.org/10.1109/ICCTCT.2018.8551020.

[7]

Das B.; Krishnan N.C.; Cook D.J. RACOG and wRACOG: Two probabilistic oversampling techniques. IEEE Trans. Knowl. Data Eng. 2015, 27, 222-234. https://doi.org/10.1109/TKDE.2014.2324567.

[8]

Tyagi S.; Mittal S. Sampling approaches for imbalanced data classification problem in machine learning. In Lecture Notes in Electrical Engineering; Springer: Berlin/Heidelberg, Germany, 2020, pp. 209-221. https://doi.org/10.1007/978-3-030-29407-6_17.

[9]

Malek N.H.A.; Yaacob W.F.W.; Wah Y.B.; et al. Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data. Indones. J. Electr. Eng. Comput. Sci. 2023, 29, 598-608. https://doi.org/10.11591/ijeecs.v29.i1.

[10]

Breiman L. Bagging Predictors; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1996.

[11]

Sahin E.K. Assessing the predictive capability of ensemble tree methods for landslide susceptibility mapping using XGBoost, gradient boosting machine, and random forest. SN Appl. Sci. 2020, 2, 1308. https://doi.org/10.1007/s42452-020-3060-1.

[12]

Klug M.; Barash Y.; Bechler S.; et al. A Gradient Boosting Machine Learning Model for Predicting Early Mortality in the Emergency Department Triage: Devising a Nine-Point Triage Score. J. Gen. Intern. Med. 2020, 35, 220-227. https://doi.org/10.1007/s11606-019-05512-7.

[13]

Mitchell T.M. Does Machine Learning Really Work? 1997. Available online: www.kdnuggets.com/. (accessed on 11 July 2023).

[14]

Müller A.C.; Guido S. Introduction to Machine Learning with Python: A Guide for Data Scientists Introduction to Machine Learning with Python. O’Reilly Media, Inc.: Sebastopol, CA, USA.

[15]

Huan Y.; Kong Q.; Mou H.; et al. Antimicrobial Peptides: Classification, Design, Application and Research Progress in Multiple Fields. Front. Microbiol. 2020, 11, 582779. https://doi.org/10.3389/fmicb.2020.582779.

[16]

van Ravenzwaaij D.; Cassey P.; Brown S.D.. A simple introduction to Markov Chain Monte-Carlo sampling. Psychon. Bull. Rev. 2018, 25, 143-154. https://doi.org/10.3758/s13423-016-1015-8.

[17]

Fadhilah R.; Kuswanto H.; Prastyo D.D.. Performance Analysis of Random Forest with Sampling for River Water Quality Classification. In Proceedings of the 2024 7th International Conference on Informatics and Computational Sciences (ICICoS),Semarang, Indonesia, 17-18 July 2024; pp. 456-461. doi:10.1109/ICICoS62600.2024.10636858.

[18]

Hanisah N.; Fairos W.; et al. A New Hybrid Sampling for Classifying Imbalanced Data Based on Ensemble Decision Tree. 2023. Available online: https://ssrn.com/abstract=4485808. (accessed on 11 July 2023).

[19]

Şahin M. Impact of weather on COVID-19 pandemic in Turkey. Sci. Total Environ. 2020, 728, 138810. doi:10.1016/j.scitotenv.2020.138810.

[20]

Khosravi Y.; Asilian-Mahabad H.; Hajizadeh E.; et al. Factors influencing unsafe behaviors and accidents on construction sites: A review. Int. J. Occup. Saf. Ergon. 2014, 20, 111-125 doi:10.1080/10803548.2014.11077023.

[21]

Liu Y.; Li H.; Xie D.. Implications of imbalanced datasets for empirical ROC-AUC estimation in binary classification tasks. J. Stat. Comput. Simul. 2014, 94, 183-203 doi:10.1080/00949655.2023.2238235.

PDF (1087KB)

0

Accesses

0

Citation

Detail

Sections
Recommended

/