Machine learning framework for breast cancer detection with feature selection with L2 ridge regularization: insights from multiple datasets

Kandhasamy Premalatha , Duraisamy Prabha Devi , Kandhasamy Sivakumar

Journal of Translational Genetics and Genomics ›› 2025, Vol. 9 ›› Issue (1) : 11 -34.

PDF
Journal of Translational Genetics and Genomics ›› 2025, Vol. 9 ›› Issue (1) :11 -34. DOI: 10.20517/jtgg.2024.82
review-article

Machine learning framework for breast cancer detection with feature selection with L2 ridge regularization: insights from multiple datasets

Author information +
History +
PDF

Abstract

Aim: This study aims to investigate and apply effective machine learning techniques for the early detection and precise diagnosis of breast cancer. The analysis is conducted using various breast cancer datasets, including Breast Cancer Wisconsin, Breast Cancer Diagnosis, NKI Breast Cancer, and SEER Breast Cancer datasets. The primary focus is on identifying key features and utilizing preprocessing methods to enhance classification accuracy.

Methods: The datasets undergo several preprocessing steps, such as label encoding for categorical variables, linear regression for handling missing values, and Robust scaler normalization for data standardization. To address class imbalance, Tomek Link SMOTE is employed to improve dataset representation. Significant features are selected through L2 Ridge regularization, helping to pinpoint the most important predictors of breast cancer. A range of machine learning models, including decision tree, random forest, support vector machine (SVM), neural network, K-nearest neighbor, naïve bayes, extreme gradient boost (XGBoost), and AdaBoost, are applied for classification tasks. The performance of these models is assessed using metrics such as accuracy, precision, recall, F1-score, and the Kappa statistic. Additionally, the models' effectiveness is further evaluated using the receiver operating characteristic curve and precision-recall curve.

Results: The XGBoost model achieved the best performance on both the breast cancer Wisconsin and diagnosis datasets. The SVM model reached 100% accuracy on the NKI breast cancer dataset, while the random forest model performed optimally on the SEER breast cancer dataset. The feature selection process through L2 Ridge regularization was crucial in enhancing the performance of these models.

Conclusions: This work emphasizes the critical role of machine learning in improving breast cancer detection. By applying a combination of preprocessing techniques and classification models, the study successfully identifies significant features and boosts model performance. These findings contribute to the development of more accurate diagnostic tools, ultimately enhancing patient outcomes.

Keywords

Breast cancer / machine learning / feature selection / data preprocessing / Tomek Link SMOTE / L2 ridge regularization

Cite this article

Download citation ▾
Kandhasamy Premalatha, Duraisamy Prabha Devi, Kandhasamy Sivakumar. Machine learning framework for breast cancer detection with feature selection with L2 ridge regularization: insights from multiple datasets. Journal of Translational Genetics and Genomics, 2025, 9(1): 11-34 DOI:10.20517/jtgg.2024.82

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

SunYS,YangZN.Risk factors and preventions of breast cancer.Int J Biol Sci2017;13:1387-97 PMCID:PMC5715522

[2]

GiaquintoAN,NewmanLA.Breast cancer statistics 2024.CA Cancer J Clin2024;74:477-95

[3]

ChaudhuryAR,IychettiraKK.Diagnosis of invasive ductal carcinoma using image processing techniques. In Proceedings of the 2011 International Conference on Image Information Processing, 3-5 November 2011, Shimla, India; pp. 1-6.

[4]

GraydonJ,Palmer-WickhamS.Information needs of women during early treatment for breast cancer.J Adv Nurs1997;26:59-64

[5]

GambleP,WangH.Determining breast cancer biomarker status and associated morphological features using deep learning.Commun Med2021;1:14 PMCID:PMC9037318

[6]

FooCT,ThompsonBR.Functional lung imaging using novel and emerging MRI techniques.Front Med2023;10:1060940 PMCID:PMC10166823

[7]

YenC,ChiangMC.Exploring the frontiers of neuroimaging: a review of recent advances in understanding brain functioning and disorders.Life2023;13:1472 PMCID:PMC10381462

[8]

NagpalP,PradhanG.MDCT imaging of the stomach: advances and applications.Br J Radiol2017;90:20160412 PMCID:PMC5605014

[9]

KhalidA,AlabrahA.Breast cancer detection and prevention using machine learning.Diagnostics2023;13:3113 PMCID:PMC10572157

[10]

KhanF,AbbasS.Cloud-based breast cancer prediction empowered with soft computing approaches.J Healthc Eng2020;2020:8017496 PMCID:PMC7254089

[11]

Al-AntariMA,ChoiMT,KimTS.A fully integrated computer-aided diagnosis system for digital X-ray mammograms via deep learning detection, segmentation, and classification.Int J Med Inf2018;117:44-54

[12]

NajiMA,AarikaK,AbdelouhahidRA.Machine learning algorithms for breast cancer prediction and diagnosis.Proc Comput Sci2021;191:487-92

[13]

NakagawaT,OgawaA.Bone marrow carcinomatosis in a stage IV breast cancer patient treated by letrozole as first-line endocrine therapy.Case Rep Oncol2022;15:436-41 PMCID:PMC9149407

[14]

PurrahmanD,SakiN,Kurkowska-JastrzębskaI.Involvement of progranulin (PGRN) in the pathogenesis and prognosis of breast cancer.Cytokine2022;151:155803

[15]

OgundokunRO,DouglasM,MaskeliūnasR.Medical internet-of-things based breast cancer diagnosis using hyperparameter-optimized neural networks.Future Int2022;14:153

[16]

SharminS,TalukderMA.A hybrid dependable deep feature extraction and ensemble-based machine learning approach for breast cancer detection.IEEE Access2023;11:87694-708

[17]

ManikandanP,PonnurajaC.An integrative machine learning framework for classifying SEER breast cancer.Sci Rep2023;13:5362 PMCID:PMC10067827

[18]

LittleRJA.Statistical analysis with missing data. Hoboken, NJ: John Wiley & Sons; 2002.

[19]

RaschkaS. Python machine learning. Birmingham, UK: Packt Publishing Ltd.; 2017. Available from: http://radio.eng.niigata-u.ac.jp/wp/wp-content/uploads/2020/06/python-machine-learning-2nd.pdf [Last accessed on 22 Jan 2025]

[20]

ChawlaNV,HallLO.SMOTE: synthetic minority over-sampling technique.J Artif Intell Res2002;16:321-57

[21]

HoerlAE.Ridge regression: biased estimation for nonorthogonal problems.Technometrics1970;12:55-67

[22]

QuinlanJR.Induction of decision trees.Mach Learn1995;1:81-106

[23]

BreimanL.Random forests.Mach Learn2001;45:5-32

[24]

CortesC.Support-vector networks.Mach Learn1995;20:273-97

[25]

RumelhartDE,WilliamsRJ.Learning representations by back-propagating errors.Nature1986;323:533-6

[26]

CoverT.Nearest neighbor pattern classification.IEEE Trans Inform Theory1967;13:21-7

[27]

JohnGH.Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI). 1995; pp. 338-45.

[28]

ChenT.XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 2016; pp. 785-94.

[29]

FreundY.A decision-theoretic generalization of on-line learning and an application to boosting. In: Lecture notes in computer science. Berlin, Heidelberg: Springer Berlin Heidelberg, 1995; pp. 23-37.

AI Summary AI Mindmap
PDF

38

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/