1. Introduction
Stroke is a devastating global health burden and remains one of the leading causes of death and long-term disability across all age groups [
1,
2]. The World Health Organization estimates that over 15 million people experience a stroke each year; approximately 5 million die and 5 million are left with permanent disability [
3]. A major challenge in effective stroke management is the significant risk of recurrence. Epidemiological evidence indicates that about 5.7–51.3% of patients experience a second stroke within the first year after the initial event, and the risk can persist for years [
4]. Recurrent stroke often results in more severe neurological impairment, increased healthcare costs, and a significant reduction in quality of life for patients and their families [
5]. Therefore, early and accurate identification of individuals at high risk of recurrence is not merely a clinical priority but also a critical public health need, enabling individualized secondary prevention strategies to mitigate this risk.
Traditional approaches for predicting the risk of stroke recurrence, such as the Essen Stroke Risk Score (ESRS), the Stroke Prognostic Instrument (SPI), and the ABCD
2 (Age, Blood pressure, Clinical features, Duration of symptoms, Diabetes) score, are widely used in routine clinical care [
6]. These models generally rely on a limited set of readily available clinical variables, including age, history of hypertension, diabetes mellitus, atrial fibrillation, and a previous transient ischemic attack (TIA) [
7]. While they provide a convenient approach to risk stratification, their predictive performance is often moderate, with validation studies demonstrating area under the curve (AUC) values of 0.6 to 0.7 [
8]. This modest accuracy indicates, in part, the limited ability of these strategies to capture the complex, multifactorial biology of stroke, which involves interactions between clinical features, biochemical pathways, and structural brain changes. Moreover, many of these models often do not incorporate detailed neuroimaging information that can provide insights into the severity and anatomical distribution of cerebral damage, all of which are important determinants of recurrence risk.
In recent years, machine learning (ML) has revolutionized various fields of medicine, including diagnostic imaging, prognostic prediction modeling, and assessment of treatment response [
9]. By processing high-dimensional data, identifying non-linear relationships, and extracting complex patterns from large datasets, ML approaches offer a promising alternative to traditional statistical methods for predicting stroke recurrence [
10]. Unlike conventional approaches, ML models can integrate diverse data sources, including routine clinical variables, laboratory results, and imaging-derived features, enabling the development of more comprehensive and more accurate prediction tools [
11].
Neuroimaging, in particular, holds significant potential for enhancing the prediction of recurrent stroke risk. Computed tomography (CT) and magnetic resonance imaging (MRI) can characterize infarct size and location and detect associated pathologies such as leukoaraiosis, cerebral microbleeds, and carotid artery stenosis [
12]. These imaging features can reflect the underlying vascular pathology, the severity of cerebral ischemia, and the burden of silent cerebrovascular disease, all of which are strongly linked to stroke recurrence. For example, larger infarct sizes have consistently been associated with a higher recurrence risk [
13], likely indicating more extensive vascular injury and a greater likelihood of unstable atherosclerotic plaques. Similarly, leukoaraiosis, a marker of cerebral small-vessel disease, has been established as an independent predictor of recurrent vascular events [
14].
Despite growing interest in applying ML in stroke research, limited studies have performed systematic comparisons of various ML algorithms for predicting stroke recurrence using a combination of routine clinical variables and imaging features. Most published studies have assessed only a single algorithm or have used one data modality alone (e.g., clinical data without imaging, or imaging without detailed clinical data), which limits our understanding of which algorithm and which data integration approach yields the best predictive performance. Additionally, prioritizing and interpreting the most influential predictors of recurrence within an integrated dataset remains crucial, both to enhance model transparency and to generate mechanistic insights that could inform the development of more effective secondary preventive strategies.
Therefore, this study aims to address these gaps by evaluating the performance of four commonly used ML approaches: logistic regression, random forest, support vector machine (SVM), and extreme gradient boosting (XGBoost). Using an integrated dataset that combines routine clinical data with detailed imaging features, the study seeks to determine which algorithm achieves the highest predictive performance for stroke recurrence. Furthermore, the study will identify the most influential predictors of recurrence within the integrated dataset and assess the generalizability of the optimal model across clinically relevant subgroups, such as patients with cortical versus subcortical infarcts. Overall, the findings may support the development of more accurate and clinically useful tools for recurrence risk stratification, enabling more individualized secondary prevention and improved patient outcomes.
2. Methods
2.1 Study Population
This study enrolled 350 patients with ischemic stroke from the Department of Neurology, The Fifth People’s Hospital of Jinan, China, between January 2018 and December 2021. Inclusion criteria were as follows: (1) diagnosis consistent with Chinese Stroke Association guidelines for clinical management of ischaemic cerebrovascular diseases: executive summary and 2023 update [
15]; (2) first-ever ischemic stroke confirmed by CT or MRI; and (3) availability of complete clinical and imaging data. However, patients were excluded if they had: (1) hemorrhagic stroke; (2) stroke secondary to trauma, tumor, or other non-atherosclerotic causes; (3) severe cognitive impairment or other conditions preventing completion of follow-up.
Patients were categorized into three groups based on the admission period: Group A (January 2018–December 2019), Group B (January 2020–June 2021), and Group C (July 2021–December 2021). This non-uniform time interval design was adopted to account for a hospital-wide transition to a digital medical record system in the later study phase (post–June 2021), which substantially improved the efficiency of patient identification and research recruitment. To ensure balanced sample sizes and baseline characteristics across groups (all
p 0.05) while maintaining consistent inclusion criteria, longer intervals were used for Groups A and B (pre-digitalization) to accumulate adequate patients, and a shorter interval was applied for Group C (post-digitalization) to avoid over-recruitment. The primary outcome was stroke recurrence, defined as a new ischemic stroke event confirmed by imaging within one year after the first stroke. A 1-year follow-up was selected because the risk of stroke recurrence is highest during the first year after the initial event, making it a critical window for intensified secondary prevention [
16]. The observed difference in monthly enrollment rates across cohorts, including the higher recruitment rate in Group C, likely reflects a hospital-wide transition to a digital medical record system during the later study phase, which significantly improved the efficiency of patient identification and research recruitment while maintaining the same inclusion criteria.
2.2 Data Collection
Two categories of variables, such as routine clinical data and imaging features, were collected for each participant. Routine clinical variables included demographic characteristics (age, gender), comorbidities (hypertension, diabetes, atrial fibrillation, coronary heart disease), laboratory results (fasting blood glucose, total cholesterol, low-density lipoprotein cholesterol, creatinine), and treatment (antiplatelet therapy recorded as a binary variable without specifying the agent or combination regimen, and statin use). Information on formal anticoagulation (e.g., warfarin or direct oral anticoagulants) was not consistently available and was therefore excluded from the analysis. Demographic factors and key comorbidities (hypertension, diabetes, atrial fibrillation, and coronary heart disease) were selected because they are well-established clinical determinants of stroke recurrence.
Imaging features included infarct size (cm2, measured by CT/MRI), infarct location (cortical, subcortical, or posterior circulation), severity of leukoaraiosis (mild, moderate, severe), and carotid artery stenosis (50% or not, assessed using ultrasound).
2.3 Machine Learning Models
Four ML algorithms selected for model construction were as follows: (i) Logistic regression (LR), a linear classifier that models the log-odds of binary outcomes, incorporating L1 regularization to reduce overfitting and support feature selection [
17]. (ii) Random forest (RF), an ensemble approach that combines multiple decision trees, using bootstrap resampling and random feature selection to enhance robustness and reduce variance [
18]. (iii) SVM is a margin-based classifier that identifies an optimal hyperplane to separate classes, using a radial basis function kernel to capture non-linear associations [
19]. (iv) XGBoost, a gradient-boosting framework that builds sequential trees with regularization to enhance generalization and minimize prediction error [
20].
Feature importance was calculated from each model’s internal metric, scoring features based on their average gain across all splits in which they contributed. For benchmarking against traditional risk stratification, the Essen Stroke Risk Score (ESRS) was also calculated for each patient.
2.4 Model Training and Evaluation
The entire cohort was randomly categorized into a training set (70%, n = 245) for model development and an independent testing set (30%, n = 105) for final performance evaluation. All data preprocessing procedures were established using the training data and then applied to the testing data to prevent data leakage. These preprocessing steps included imputation of missing values (median for continuous variables and mode for categorical variables), standardization of continuous variables, one-hot encoding of categorical variables, winsorization of outliers at the 1st and 99th percentiles, and application of the Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance.
Model hyperparameters were optimized using 5-fold cross-validation within the training set, applying grid search for LR, RF, and SVM, and Bayesian optimization for XGBoost. The hyperparameter search ranges and the optimally selected values are detailed in
Table 1. Model performance was determined on the independent testing set using the AUC, sensitivity, specificity, and accuracy. All analyses were performed in Python 3.9 (Python Software Foundation, Beaverton, OR, USA) using scikit-learn (v1.0.2) and XGBoost (v1.5.1) libraries.
2.5 Statistical Analysis
Statistical analyses were conducted using Python (v3.9; Python Software Foundation, Beaverton, OR, USA) with the scikit-learn (v1.0.2) and XGBoost (v1.5.1) libraries, and R (v4.1.2; R Foundation for Statistical Computing, Vienna, Austria) with the tidyverse (v1.3.1) and pROC (v1.18.0) packages. Categorical variables are presented as frequencies and percentages (n, %). Group comparisons were performed using Pearson’s chi-square test. Continuous variables: Normality was assessed using the Shapiro-Wilk test, and homogeneity of variances was assessed using Levene’s test. Normally distributed continuous variables are presented as mean standard deviation (mean SD) and compared using Student’s t-test (two groups) or one-way analysis of variance (ANOVA; three or more groups). Non-normally distributed continuous variables are presented as median (interquartile range, IQR) and compared using the Mann-Whitney U test (two groups) or Kruskal-Wallis test (three or more groups). All statistical tests were two-tailed, and a p-value of 0.05 was considered statistically significant.
Model calibration, representing agreement between predicted probabilities and observed outcomes, was assessed using the Hosmer-Lemeshow goodness-of-fit test. To further evaluate the key predictors identified by the best-performing model , multivariate logistic regression was performed with adjustment for potential confounders.
3. Results
3.1 Comparison of Baseline Characteristics Across Three Groups
The baseline characteristics of the three groups are summarized in
Table 2. No significant differences were found across three groups (Group A, B, and C) regarding age, gender, comorbidities, or imaging features (all
p 0.05), indicating that the groups were well balanced at baseline.
3.2 Comparison of Characteristics Between the Training and Testing Sets
Comparison of baseline characteristics between the training set (70% of patients, n = 245) and the testing set (30%, n = 105) is detailed in
Table 3. No substantial differences were observed across any variables, including demographic factors, comorbidities, laboratory assessments, treatments, and imaging features (all
p 0.05), confirming balanced randomization. This balance ensures the validity of subsequent model training and validation.
3.3 Stroke Recurrence Rate
Stroke recurrence rates across predefined subgroups, including admission-period groups, infarct location, and key clinical risk factors, are summarized in
Table 4. Recurrence rates were comparable across the three time-period groups. Conversely, hypertension and carotid stenosis (
50%) were linked to significantly higher recurrence rates, underscoring their role in recurrent stroke risk.
3.4 Predictive Performance of the ML Models
Predictive performance of the four ML models for stroke recurrence is shown in
Table 5. Among them, the XGBoost model achieved the highest discrimination, with an AUC of 0.86 (95% confidence interval [CI]: 0.79–0.92), followed by RF (AUC 0.82, 95% CI: 0.75–0.89), SVM (AUC 0.78, 95% CI: 0.70–0.86), and LR (AUC 0.75, 95% CI: 0.67–0.83). Additionally, the XGBoost model showed the highest sensitivity (81.0%), specificity (84.1%), and overall accuracy (83.5%).
3.5 Calibration of Models
Calibration of all five predictive models, reflecting the agreement between predicted probabilities and observed outcomes, was assessed using the Hosmer-Lemeshow goodness-of-fit test. As shown in Supplementary Table 1, all models, including the traditional ESRS, demonstrated good calibration, with non-significant p-values (all p 0.05). These findings indicate close agreement between predicted and observed stroke recurrence risk.
3.6 Subgroup Analysis by Infarct Location
A subgroup analysis stratified by infarct location was conducted to evaluate whether the predictive performance differed across etiologically distinct stroke subtypes, despite comparable overall recurrence rates. To assess the generalizability of the optimal model across these pathophysiologically heterogeneous stroke subtypes, model performance was evaluated individually in subgroups stratified by infarct location: cortical, subcortical, and posterior circulation. As described in
Table 6, XGBoost maintained the highest performance across all three subgroups, achieving an AUC of 0.88 (95% CI: 0.80–0.96) for cortical infarcts, 0.84 (95% CI: 0.76–0.92) for subcortical infarcts, and 0.81 (95% CI: 0.70–0.92) for posterior circulation infarcts. Random forest followed as the second-best performer in each subgroup, with AUCs of 0.83, 0.80, and 0.78, respectively.
3.7 Key Predictors of Stroke Recurrence
The ten most influential predictors of stroke recurrence identified by the XGBoost model based on feature importance ranking are listed in
Table 7. Infarct size demonstrated the greatest contribution (100.0), followed by a history of hypertension (85.2) and fasting blood glucose (78.6), suggesting crucial roles in recurrence risk prediction.
3.8 Multivariate Logistic Regression for Key Predictors
Multivariate logistic regression findings assessing associations between key predictors and stroke recurrence are shown in
Table 8. It revealed that infarct size (odds ratio [OR] = 2.15, 95% CI: 1.52–3.04), hypertension (OR = 1.89, 95% CI: 1.12–3.18), and fasting blood glucose (OR = 1.67, 95% CI: 1.03–2.71) were independently associated with increased recurrence risk of stroke (all
p 0.05).
4. Discussion
The present study systematically compared the performance of four machine learning algorithms for predicting stroke recurrence using an integrated set of routine clinical variables and imaging features. Among them, the XGBoost model demonstrated the strongest predictive performance, achieving an AUC of 0.86. The findings underscore the potential of ML-based approaches to enhance risk stratification for stroke recurrence and to address key limitations of traditional prediction models that rely on a narrow set of clinical variables.
The superior performance of XGBoost compared with logistic regression, random forest, and SVM aligns with previous findings highlighting that gradient-boosting frameworks are well-suited to complex, high-dimensional clinical datasets [
21]. A possible explanation for its superior performance is XGBoost’s capability to model non-linear relationships and higher-order interactions among variables, such as the synergistic effect of infarct size and hypertension. For instance, while large infarcts are associated with higher recurrence risk, this effect may be significantly amplified in patients with poorly controlled hypertension, a relationship that linear models such as logistic regression may not capture adequately. This capability is particularly relevant in stroke research, where recurrence risk is determined by a complex interaction of vascular, metabolic, and neuroimaging-related factors.
Integrating imaging-derived features into the predictive models represents a key strength of this study. Traditional models often overlook neuroimaging data because of its analytical complexity and the need for specialized interpretation; however, our results indicate that imaging features, particularly infarct size, contribute significantly to recurrence prediction. Infarct size, ranked as a crucial predictor in the XGBoost model, consistent with previous evidence linking larger infarcts to higher recurrence risk [
22]. Larger infarcts usually reflect more severe arterial occlusion, greater ischemic injury, and a higher likelihood of underlying vasculopathy, which all together increase the risk of subsequent cerebrovascular events [
23]. Additionally, incorporating markers such as leukoaraiosis and carotid artery stenosis captures the contributions of small-vessel disease and large-artery atherosclerosis, respectively, thereby enhancing the clinical relevance of risk stratification [
24].
The identification of hypertension and fasting blood glucose as key predictors reinforces the crucial role of metabolic and vascular risk management in secondary prevention. Hypertension, a well-established driver of stroke pathogenesis, promotes arteriosclerosis, disrupts endothelial function, and increases susceptibility to small vessel occlusion [
25]. Similarly, elevated fasting blood glucose levels, even among individuals without a diagnosis of diabetes, may indicate insulin resistance and systemic inflammation, both of which contribute to vascular injury and thrombus formation [
26]. Notably, lifestyle-based interventions can significantly improve these metabolic parameters [
27]. These findings support current clinical guidelines that emphasize tight blood pressure and glycemic management after stroke, while also highlighting how ML-based models may help identify high-risk individuals who could benefit from more aggressive intervention.
Subgroup analyses revealed that the XGBoost model maintained strong predictive performance across patients with cortical, subcortical, and posterior circulation infarcts, suggesting good generalizability in distinct stroke subtypes with varying etiologies (e.g., large-artery atherosclerosis for cortical, small-vessel disease for subcortical, and vertebrobasilar pathology for posterior circulation). This result is clinically relevant because cortical and subcortical strokes often have distinct etiologies, such as large-artery atherosclerosis and small-vessel disease, and may therefore require tailored preventive strategies [
15]. The consistent performance of the model across these subgroups supports its potential ability as a flexible and broadly applicable approach in clinical risk stratification.
Our results also highlight the limitations of traditional risk scores. For example, the ESRS, which relies on variables such as age, hypertension, and diabetes, typically achieves an AUC of about 0.65–0.70 for predicting recurrence [
28]. In contrast, the XGBoost model yielded an AUC of 0.86, representing a meaningful improvement in predictive accuracy that could improve identification of high-risk patients. However, ML-based models should be used to complement, not replace, clinical decision-making. While the XGBoost model provides a quantitative risk estimation, clinicians should interpret these findings alongside patient-specific factors, including adherence to medication and lifestyle factors, to guide tailored management.
Several limitations of the study should be considered before interpreting these results. First, the single-center, retrospective design may limit the generalizability of the findings. Variations in clinical practice patterns, imaging acquisition and interpretation, and follow-up procedures across institutions could affect model performance, emphasizing the need for external validation in multicenter cohorts. Second, the study focused on recurrence within the first year of stroke, and longer follow-up is needed to assess how well these models predict late recurrent events. Third, several potentially informative predictors, including genetic markers, lifestyle factors (e.g., smoking status and physical activity), and detailed data on medication adherence, were not included due to unavailability in electronic medical records. Incorporating these variables in future studies may further improve predictive accuracy. Fourth, while the XGBoost model demonstrated strong performance, the restricted interpretability typical of “black box” models may hinder clinical acceptance without robust explanation frameworks and prospective assessment. Fifth, and importantly, antithrombotic medications were inadequately characterized. The “antiplatelet therapy” was captured only as a binary variable and did not distinguish between single or dual regimens. Crucially, anticoagulant use, which is a critical determinant of recurrence prevention in patients with atrial fibrillation, was not consistently available. The absence of this key confounder likely affected the model’s performance and should be addressed in future studies.
Despite these limitations, this study advances our understanding of ML-based stroke recurrence prediction by demonstrating the benefit of integrating routine clinical variables with imaging-derived data. The XGBoost model demonstrated high discriminative performance and consistent outcomes across subgroups, indicating potential application for supporting personalized secondary prevention strategies. However, the single-center, retrospective design and the lack of external validation remain significant limitations and may restrict generalizability. The lack of external validation in diverse, multi-center cohorts represents a significant limitation, potentially affecting the generalizability of our model. Future studies should prioritize external validation to ensure robustness across different patient populations, imaging protocols, and clinical workflows. Furthermore, restricting outcomes to a 1-year recurrence window does not capture late recurrent events, and longer follow-up would strengthen the clinical relevance of the model. Future studies should focus on external validation, incorporating additional predictive variables (such as lifestyle, adherence, and other biologically informative predictors), and develop practical, user-friendly tools to facilitate implementation in routine clinical care.
In summary, machine learning algorithms that integrate routine clinical variables with imaging-derived features can effectively predict stroke recurrence risk, with the XGBoost model offering the highest overall performance. Infarct size, hypertension, and fasting blood glucose were identified as most influential predictors, underscoring the importance of structural neuroimaging and rigorous management of metabolic and vascular risk factors in secondary prevention. These findings support the use of ML-based models as adjuncts to clinical decision-making, with the potential to improve outcomes by facilitating more targeted risk reduction approaches.
5. Conclusion
This study demonstrates that machine learning algorithms integrating routine clinical data and imaging features can predict stroke recurrence risk effectively, with the XGBoost model achieving the highest overall performance. The key predictors, particularly infarct size and a history of hypertension, underscore the significance of structural brain injury and vascular-metabolic dysregulation in driving recurrence risk. Robust performance across cortical, subcortical, and posterior circulation infarct subgroups further supports the model’s potential clinical utility in diverse stroke subtypes with distinct pathophysiological mechanisms.
Key Points
• Machine learning models, particularly XGBoost, that integrate both routine clinical and imaging-derived features demonstrate a higher predictive performance for stroke recurrence risk than traditional models.
• Infarct size, a history of hypertension, and fasting blood glucose levels were identified as the most influential predictors of recurrence.
• The XGBoost model maintained robust predictive performance across different stroke subtypes defined by infarct location.
• This study highlights the potential of applying advanced analytical methods and multimodal data for enhancing risk stratification and supporting personalized secondary prevention strategies in stroke survivors.
Availability of Data and Materials
The datasets analyzed during the current study are available from the corresponding author on reasonable request.