Adaptive feature selection method for high-dimensional imbalanced data classification

Jianzhen WU , Zhen XUE , Liangliang ZHANG , Xu YANG

Journal of Measurement Science and Instrumentation ›› 2025, Vol. 16 ›› Issue (4) : 612 -624.

PDF (2484KB)
Journal of Measurement Science and Instrumentation ›› 2025, Vol. 16 ›› Issue (4) :612 -624. DOI: 10.62756/jmsi.1674-8042.2025059
Test and detection technology
research-article

Adaptive feature selection method for high-dimensional imbalanced data classification

Author information +
History +
PDF (2484KB)

Abstract

Data collected in fields such as cybersecurity and biomedicine often encounter high dimensionality and class imbalance. To address the problem of low classification accuracy for minority class samples arising from numerous irrelevant and redundant features in high-dimensional imbalanced data, we proposed a novel feature selection method named AMF-SGSK based on adaptive multi-filter and subspace-based gaining sharing knowledge. Firstly, the balanced dataset was obtained by random under-sampling. Secondly, combining the feature importance score with the AUC score for each filter method, we proposed a concept called feature hardness to judge the importance of feature, which could adaptively select the essential features. Finally, the optimal feature subset was obtained by gaining sharing knowledge in multiple subspaces. This approach effectively achieved dimensionality reduction for high-dimensional imbalanced data. The experiment results on 30 benchmark imbalanced datasets showed that AMF-SGSK performed better than other eight commonly used algorithms including BGWO and IG-SSO in terms of F1-score, AUC, and G-mean. The mean values of F1-score, AUC, and G-mean for AMF-SGSK are 0.950, 0.967, and 0.965, respectively, achieving the highest among all algorithms. And the mean value of G-mean is higher than those of IG-PSO, ReliefF-GWO, and BGOA by 3.72%, 11.12%, and 20.06%, respectively. Furthermore, the selected feature ratio is below 0.01 across the selected ten datasets, further demonstrating the proposed method’s overall superiority over competing approaches. AMF-SGSK could adaptively remove irrelevant and redundant features and effectively improve the classification accuracy of high-dimensional imbalanced data, providing scientific and technological references for practical applications.

Keywords

high-dimensional imbalanced data / adaptive feature selection / adaptive multi-filter / feature hardness / gaining sharing knowledge based algorithm / metaheuristic algorithm

Cite this article

Download citation ▾
Jianzhen WU, Zhen XUE, Liangliang ZHANG, Xu YANG. Adaptive feature selection method for high-dimensional imbalanced data classification. Journal of Measurement Science and Instrumentation, 2025, 16(4): 612-624 DOI:10.62756/jmsi.1674-8042.2025059

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

QIU Y, MA L, PRIYADARSHI R. Deep learning challenges and prospects in wireless sensor network deployment. Archives of Computational Methods in Engineering, 2024, 31(6): 3231-3254.

[2]

LUO T, XIE J P, ZHANG B T, et al. An improved levy chaotic particle swarm optimization algorithm for energy-efficient cluster routing scheme in industrial wireless sensor networks. Expert Systems with Applications, 2024, 241: 122780.

[3]

PRIYADARSHI R. Exploring machine learning solutions for overcoming challenges in IoT-based wireless sensor network routing: a comprehensive review. Wireless Networks, 2024, 30(4): 2647-2673.

[4]

OUADERHMAN T, CHAMLAL H, JANANE F Z. A new filter-based gene selection approach in the DNA microarray domain. Expert Systems with Applications, 2024, 240: 122504.

[5]

KAMALOV F, LEUNG H H. Outlier detection in high dimensional data. Journal of Information & Knowledge Management, 2020, 19(1): 2040013.

[6]

CUI J Y, ZONG L S, XIE J H, et al. A novel multi-module integrated intrusion detection system for high-dimensional imbalanced data. Applied Intelligence, 2023, 53(1): 272-288.

[7]

HE H B, GARCIA E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.

[8]

THAI-NGHE N, GANTNER Z, SCHMIDT-THIEME L. Cost-sensitive learning methods for imbalanced data//The 2010 International Joint Conference on Neural Networks, July 18-23, 2010, Barcelona, Spain. New York: IEEE, 2010: 1-8.

[9]

SAGI O, ROKACH L. Ensemble learning: a survey. WIREs Data Mining and Knowledge Discovery, 2018, 8(4): e1249.

[10]

CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321-357.

[11]

HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning//Advances in Intelligent Computing. Berlin, Heidelberg: Springer, 2005: 878-887.

[12]

HE H B, BAI Y, GARCIA E A, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning//2008 IEEE International Joint Conference on Neural Networks, June 1-8, 2008, Hong Kong, China. New York: IEEE, 2008: 1322-1328.

[13]

RAMENTOL E, CABALLERO Y, BELLO R, et al. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowledge and Information Systems, 2012, 33(2): 245-265.

[14]

BARUA S, ISLAM M M, YAO X, et al. MWMOTE: majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(2): 405-425.

[15]

KHALID S, KHALIL T, NASREEN S. A survey of feature selection and feature extraction techniques in machine learning//2014 Science and Information Conference, August 27-29, 2014, London, UK. New York: IEEE, 2014: 372-378.

[16]

SUN L, SI S S, DING W P, et al. TFSFB: Two-stage feature selection via fusing fuzzy multi-neighborhood rough set with binary whale optimization for imbalanced data. Information Fusion, 2023, 95: 91-108.

[17]

SONG X F, ZHANG Y, GONG D W, et al. A fast hybrid feature selection based on correlation-guided clustering and particle swarm optimization for high-dimensional data. IEEE Transactions on Cybernetics, 2022, 52(9): 9573- 9586.

[18]

LIU H, SETIONO R. Chi2: feature selection and discretization of numeric attributes//7th IEEE International Conference on Tools with Artificial Intelligence, November 5-8, 1995, Herndon, VA, USA. New York: IEEE, 1995: 388-391.

[19]

SHANG C X, LI M, FENG S Z, et al. Feature selection via maximizing global information gain for text classification. Knowledge-Based Systems, 2013, 54: 298-309.

[20]

PENG H C, LONG F H, DING C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238.

[21]

ROBNIK-ŠIKOJA M, KONONENKO I. Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning, 2003, 53(1): 23-69.

[22]

DOKEROGLU T, DENIZ A, KIZILOZ H E. A comprehensive survey on recent metaheuristics for feature selection. Neurocomputing, 2022, 494: 269-296.

[23]

WHITLEY D. A genetic algorithm tutorial. Statistics and Computing, 1994, 4(2): 65-85.

[24]

EBERHART, SHI Y H. Particle swarm optimization: developments, applications and resources//2001 Congress on Evolutionary Computation, May 27-30, 2001, Seoul, Korea. New York: IEEE, 2001: 81-86.

[25]

DORIGO M, BIRATTARI M, STUTZLE T. Ant colony optimization. IEEE Computational Intelligence Magazine, 2006, 1(4): 28-39.

[26]

MIRJALILI S, MIRJALILI S M, LEWIS A. Grey wolf optimizer. Advances in Engineering Software, 2014, 69: 46-61.

[27]

SAREMI S, MIRJALILI S, LEWIS A. Grasshopper optimisation algorithm: theory and application. Advances in Engineering Software, 2017, 105: 30-47.

[28]

YIN Y H, JANG-JACCARD J, XU W, et al. IGRF-RFE: a hybrid feature selection method for MLP-based network intrusion detection on UNSW-NB15 dataset. Journal of Big Data, 2023, 10(1): 15.

[29]

ZHAO B J, YANG D S, KARIMI H R, et al. Filter-wrapper combined feature selection and adaboost-weighted broad learning system for transformer fault diagnosis under imbalanced samples. Neurocomputing, 2023, 560: 126803.

[30]

VOMMI A M, BATTULA T K. A hybrid filter-wrapper feature selection using Fuzzy KNN based on Bonferroni mean for medical datasets classification: a COVID-19 case study. Expert Systems with Applications, 2023, 218: 119612.

[31]

XU Y H, YU Z W, PHILIP CHEN C L P. Classifier ensemble based on multiview optimization for high-dimensional imbalanced data classification. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(1): 870-883.

[32]

ZHANG C K, ZHOU Y, GUO J W, et al. Research on classification method of high-dimensional class-imbalanced datasets based on SVM. International Journal of Machine Learning and Cybernetics, 2019, 10(7): 1765-1778.

[33]

AYDOGAN E K, OZMEN M, DELICE Y. CBR-PSO: cost-based rough particle swarm optimization approach for high-dimensional imbalanced problems. Neural Computing and Applications, 2019(10), 31: 6345-6363.

[34]

ZHANG Y, WANG Y H, GONG D W, et al. Clustering-guided particle swarm feature selection algorithm for high-dimensional imbalanced data with missing values. IEEE Transactions on Evolutionary Computation, 2021, 26(4): 616-630.

[35]

ALMOTAIRI K H. Gene selection for high-dimensional imbalanced biomedical data based on marine predators algorithm and evolutionary population dynamics. Arabian Journal for Science and Engineering, 2024, 49(3): 3935-3961.

[36]

MOAYEDIKIA A, ONG K L, BOO Y L, et al. Feature selection for high dimensional imbalanced class data using harmony search. Engineering Applications of Artificial Intelligence, 2017, 57: 38-49.

[37]

ABDULRAUF SHARIFAI G, ZAINOL Z. Feature selection for high-dimensional and imbalanced biomedical data based on robust correlation based redundancy and binary grasshopper optimization algorithm. Genes, 2020, 11(7): 717.

[38]

SHARIFAI A G, ZAINOL Z B. Multiple filter-based rankers to guide hybrid grasshopper optimization algorithm and simulated annealing for feature selection with high dimensional multi-class imbalanced datasets. IEEE Access, 2021, 9: 74127-74142.

[39]

SAHU B, PANIGRAHI A, ROUT S K, et al. Hybrid multiple filter embedded political optimizer for feature selection//2022 International Conference on Intelligent Controller and Computing for Smart Power, July 21-23, 2022, Hyderabad, India. New York: IEEE, 2022: 1-6.

[40]

LIU Y, WANG Y Z, REN X G, et al. A classification method based on feature selection for imbalanced data. IEEE Access, 2019, 7: 81794-81807.

[41]

KIM J, KANG J, SOHN M. Ensemble learning-based filter-centric hybrid feature selection framework for high-dimensional imbalanced data. Knowledge-Based Systems, 2021, 220: 106901.

[42]

ST L, WOLD S. Analysis of variance (ANOVA). Chemometrics and Intelligent Laboratory Systems, 1989, 6(4): 259-272.

[43]

SAIDI R, BOUAGUEL W, ESSOUSSI N. Hybrid feature selection method based on the genetic algorithm and pearson correlation coefficient. Machine Learning Paradigms: Theory and Application. Cham: Springer International Publishing, 2019: 3-24.

[44]

SENLIOL B, GULGEZEN G, YU L, et al. Fast Correlation Based Filter (FCBF) with a different search strategy//2008 23rd International Symposium on Computer and Information Sciences, October 27-29, 2008, Istanbul, Turkey. New York: IEEE, 2008: 1-4.

[45]

MOHAMED A W, HADI A A, MOHAMED A K. Gaining-sharing knowledge based algorithm for solving optimization problems: a novel nature-inspired algorithm. International Journal of Machine Learning and Cybernetics, 2020, 11(7): 1501-1529.

[46]

COVER T, HART P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 1967, 13(1): 21-27.

[47]

VANSCHOREN J, VAN RIJN J N, BISCHL B, et al. OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 2014, 15(2): 49-60.

[48]

FRANK A. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2010.

[49]

LI J D, CHENG K W, WANG S H, et al. Feature selection: a data perspective. ACM Computing Surveys, 2017, 50(6): 1-45.

[50]

WARDHANI N W S, ROCHAYANI M Y, IRIANY A, et al. Cross-validation metrics for evaluating classification performance on imbalanced data//2019 International Conference on Computer, Control, Informatics and its Applications, October 23-24, 2019, Tangerang, Indonesia. New York: IEEE, 2019: 14-18.

[51]

EMARY E, ZAWBAA H M, HASSANIEN A E. Binary grey wolf optimization approaches for feature selection. Neurocomputing, 2016, 172: 371-381.

[52]

MAFARJA M, ALJARAH I, FARIS H, et al. Binary grasshopper optimisation algorithm approaches for feature selection problems. Expert Systems with Applications, 2019, 117: 267-286.

[53]

DASS S, MISTRY S, SARKAR P. Identification of promising biomarkers in cancer diagnosis using a hybrid model combining reliefF and grey Wolf optimization//International Conference on Communication and Intelligent Systems. Singapore: Springer Nature Singapore, 2022: 311-321.

[54]

SUN L, KONG X L, XU J C, et al. A hybrid gene selection method based on ReliefF and ant colony optimization algorithm for tumor classification. Scientific Reports, 2019, 9(1): 8978.

[55]

YİĞİT F, BAYKAN Ö K. A new feature selection method for text categorization based on information gain and particle swarm optimization//2014 IEEE 3rd International Conference on Cloud Computing and Intelligence Systems, November 27-29, 2014, Shenzhen, China. New York: IEEE, 2014: 523-529.

[56]

YE C C, PAN J L, JIN Q. An improved SSO algorithm for cyber-enabled tumor risk analysis based on gene selection. Future Generation Computer Systems, 2019, 92: 407-418.

PDF (2484KB)

163

Accesses

0

Citation

Detail

Sections
Recommended

/