PDF
Abstract
Aim: This study aims to investigate and apply effective machine learning techniques for the early detection and precise diagnosis of breast cancer. The analysis is conducted using various breast cancer datasets, including Breast Cancer Wisconsin, Breast Cancer Diagnosis, NKI Breast Cancer, and SEER Breast Cancer datasets. The primary focus is on identifying key features and utilizing preprocessing methods to enhance classification accuracy.
Methods: The datasets undergo several preprocessing steps, such as label encoding for categorical variables, linear regression for handling missing values, and Robust scaler normalization for data standardization. To address class imbalance, Tomek Link SMOTE is employed to improve dataset representation. Significant features are selected through L2 Ridge regularization, helping to pinpoint the most important predictors of breast cancer. A range of machine learning models, including decision tree, random forest, support vector machine (SVM), neural network, K-nearest neighbor, naïve bayes, extreme gradient boost (XGBoost), and AdaBoost, are applied for classification tasks. The performance of these models is assessed using metrics such as accuracy, precision, recall, F1-score, and the Kappa statistic. Additionally, the models' effectiveness is further evaluated using the receiver operating characteristic curve and precision-recall curve.
Results: The XGBoost model achieved the best performance on both the breast cancer Wisconsin and diagnosis datasets. The SVM model reached 100% accuracy on the NKI breast cancer dataset, while the random forest model performed optimally on the SEER breast cancer dataset. The feature selection process through L2 Ridge regularization was crucial in enhancing the performance of these models.
Conclusions: This work emphasizes the critical role of machine learning in improving breast cancer detection. By applying a combination of preprocessing techniques and classification models, the study successfully identifies significant features and boosts model performance. These findings contribute to the development of more accurate diagnostic tools, ultimately enhancing patient outcomes.
Keywords
Breast cancer
/
machine learning
/
feature selection
/
data preprocessing
/
Tomek Link SMOTE
/
L2 ridge regularization
Cite this article
Download citation ▾
Kandhasamy Premalatha, Duraisamy Prabha Devi, Kandhasamy Sivakumar.
Machine learning framework for breast cancer detection with feature selection with L2 ridge regularization: insights from multiple datasets.
Journal of Translational Genetics and Genomics, 2025, 9(1): 11-34 DOI:10.20517/jtgg.2024.82
| [1] |
Sun YS,Yang ZN.Risk factors and preventions of breast cancer.Int J Biol Sci2017;13:1387-97 PMCID:PMC5715522
|
| [2] |
Giaquinto AN,Newman LA.Breast cancer statistics 2024.CA Cancer J Clin2024;74:477-95
|
| [3] |
Chaudhury AR,Iychettira KK.Diagnosis of invasive ductal carcinoma using image processing techniques. In Proceedings of the 2011 International Conference on Image Information Processing, 3-5 November 2011, Shimla, India; pp. 1-6.
|
| [4] |
Graydon J,Palmer-Wickham S.Information needs of women during early treatment for breast cancer.J Adv Nurs1997;26:59-64
|
| [5] |
Gamble P,Wang H.Determining breast cancer biomarker status and associated morphological features using deep learning.Commun Med2021;1:14 PMCID:PMC9037318
|
| [6] |
Foo CT,Thompson BR.Functional lung imaging using novel and emerging MRI techniques.Front Med2023;10:1060940 PMCID:PMC10166823
|
| [7] |
Yen C,Chiang MC.Exploring the frontiers of neuroimaging: a review of recent advances in understanding brain functioning and disorders.Life2023;13:1472 PMCID:PMC10381462
|
| [8] |
Nagpal P,Pradhan G.MDCT imaging of the stomach: advances and applications.Br J Radiol2017;90:20160412 PMCID:PMC5605014
|
| [9] |
Khalid A,Alabrah A.Breast cancer detection and prevention using machine learning.Diagnostics2023;13:3113 PMCID:PMC10572157
|
| [10] |
Khan F,Abbas S.Cloud-based breast cancer prediction empowered with soft computing approaches.J Healthc Eng2020;2020:8017496 PMCID:PMC7254089
|
| [11] |
Al-Antari MA,Choi MT,Kim TS.A fully integrated computer-aided diagnosis system for digital X-ray mammograms via deep learning detection, segmentation, and classification.Int J Med Inf2018;117:44-54
|
| [12] |
Naji MA,Aarika K,Abdelouhahid RA.Machine learning algorithms for breast cancer prediction and diagnosis.Proc Comput Sci2021;191:487-92
|
| [13] |
Nakagawa T,Ogawa A.Bone marrow carcinomatosis in a stage IV breast cancer patient treated by letrozole as first-line endocrine therapy.Case Rep Oncol2022;15:436-41 PMCID:PMC9149407
|
| [14] |
Purrahman D,Saki N,Kurkowska-Jastrzębska I.Involvement of progranulin (PGRN) in the pathogenesis and prognosis of breast cancer.Cytokine2022;151:155803
|
| [15] |
Ogundokun RO,Douglas M,Maskeliūnas R.Medical internet-of-things based breast cancer diagnosis using hyperparameter-optimized neural networks.Future Int2022;14:153
|
| [16] |
Sharmin S,Talukder MA.A hybrid dependable deep feature extraction and ensemble-based machine learning approach for breast cancer detection.IEEE Access2023;11:87694-708
|
| [17] |
Manikandan P,Ponnuraja C.An integrative machine learning framework for classifying SEER breast cancer.Sci Rep2023;13:5362 PMCID:PMC10067827
|
| [18] |
Little RJA.Statistical analysis with missing data. Hoboken, NJ: John Wiley & Sons; 2002.
|
| [19] |
Raschka S. Python machine learning. Birmingham, UK: Packt Publishing Ltd.; 2017. Available from: http://radio.eng.niigata-u.ac.jp/wp/wp-content/uploads/2020/06/python-machine-learning-2nd.pdf [Last accessed on 22 Jan 2025]
|
| [20] |
Chawla NV,Hall LO.SMOTE: synthetic minority over-sampling technique.J Artif Intell Res2002;16:321-57
|
| [21] |
Hoerl AE.Ridge regression: biased estimation for nonorthogonal problems.Technometrics1970;12:55-67
|
| [22] |
Quinlan JR.Induction of decision trees.Mach Learn1995;1:81-106
|
| [23] |
Breiman L.Random forests.Mach Learn2001;45:5-32
|
| [24] |
Cortes C.Support-vector networks.Mach Learn1995;20:273-97
|
| [25] |
Rumelhart DE,Williams RJ.Learning representations by back-propagating errors.Nature1986;323:533-6
|
| [26] |
Cover T.Nearest neighbor pattern classification.IEEE Trans Inform Theory1967;13:21-7
|
| [27] |
John GH.Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI). 1995; pp. 338-45.
|
| [28] |
Chen T.XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 2016; pp. 785-94.
|
| [29] |
Freund Y.A decision-theoretic generalization of on-line learning and an application to boosting. In: Lecture notes in computer science. Berlin, Heidelberg: Springer Berlin Heidelberg, 1995; pp. 23-37.
|