Evaluating machine learning model for investigating surface chloride concentration of concrete exposed to tidal environment

Thi Tuyet Trinh NGUYEN; Long Khanh NGUYEN

doi:10.1007/s11709-025-1135-1

PDF(4973 KB)

Front. Struct. Civ. Eng. ›› 2025, Vol. 19 ›› Issue (2) : 262-283. DOI: 10.1007/s11709-025-1135-1

RESEARCH ARTICLE

Evaluating machine learning model for investigating surface chloride concentration of concrete exposed to tidal environment

Thi Tuyet Trinh NGUYEN¹ ,
Long Khanh NGUYEN²

Author information +

History +

Abstract

The surface chloride concentration of concrete is a critical factor in determining the service life of concrete in tidal environments. This study aims to identify an effective Machine Learning (ML) model for predicting and assessing surface chloride concentration in such conditions. Using a database that includes 12 input variables and 386 samples of surface chloride concentration in seawater-exposed concrete, the study evaluates the predictive performance of nine ML models. Among these models, the Gradient Boosting (GB) model, using default hyperparameters, demonstrates the best performance, achieving a coefficient of determination (R²) of 0.920 and a root mean square error of 0.103% by weight of concrete for the testing data set. Furthermore, an Excel file based on the GB model is created to estimate surface chloride concentration, simplifying the mix design process according to concrete durability requirements. The Shapley additive explanation values and partial dependence plot one dimension offer a detailed analysis of the impact of the 12 variables on surface chloride concentration. The four most influential factors are, in descending order, fine aggregate content, exposure time, annual mean temperature, and coarse aggregate content. Specifically, surface chloride concentration increases linearly with prolonged exposure time, stabilizing after a certain period, while higher fine aggregate content leads to a reduction in surface chloride concentration.

Graphical abstract

Keywords

machine learning / surface chloride concentration / seawater / factor effect / service life / tidal environment

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Thi Tuyet Trinh NGUYEN, Long Khanh NGUYEN. Evaluating machine learning model for investigating surface chloride concentration of concrete exposed to tidal environment. Front. Struct. Civ. Eng., 2025, 19(2): 262‒283 https://doi.org/10.1007/s11709-025-1135-1

1 Introduction

Reinforced Concrete (RC) is a commonly employed material in infrastructure due to its cost-effectiveness and durability. However, the problem caused by the deterioration caused by chloride leads to significant expenses for maintaining RC structures, particularly those located in coastal, marine, and offshore areas. If these chloride ions accumulate in the concrete matrix surrounding the reinforcing steel, they may compromise the passive film, initiating and accelerating steel corrosion. This corrosion process can result in concrete cracking, spalling, and a deterioration of the load-carrying capacity of RC structures. Importantly, this degradation becomes increasingly significant with prolonged exposure time [1], potentially causing premature failure of RC structures, leading to engineering accidents, safety hazards, and substantial economic losses. The corrosion of steel in RC structures not only affects the regular functioning of engineering structures but also presents risks of engineering accidents, safety hazards, and significant economic losses [2]. Hence, the incorporation of durability design or service life prediction has emerged as a crucial element in contemporary concrete structure design.

The most popular conception of service life model was initiated by Tuutti [3], time of service life includes time of initiation t_i and time of propagation t_p. Following the introduction of chloride ions into the concrete, the reinforcement becomes susceptible to corrosion, which ultimately results in the deterioration of the steel [4–6]. This occurs when the chloride concentration around the reinforcing steel reaches a critical level. The duration needed for corrosion to commence is commonly denoted as the corrosion initiation period called time of initiation t_i. According to different standard [7], the value of the critical chloride concentration can be different however the time of initiation t_i can be calculated from reactive transport model [8]. Therefore, the time of design life estimation depends strongly on the time of propagation t_p. Ranjith et al. [9] utilized various models, including Bazant’s model and Wang and Zhao’s model, to evaluate the value of t_p in their investigation. In the study, the authors affirm that the surface chloride content of concrete is the most crucial input variable for predicting t_p. Hence, accurately assessing the surface chloride concentration of concrete is a crucial aspect in estimating its durability and projecting the lifespan of concrete structures.

Numerous prior publications have presented a wide range of field values for the surface chloride concentration, offering empirical models for surface chloride concentration of concrete [2,10,11]. However, these models have demonstrated limited efficacy in predicting or characterizing surface chloride concentration. This limitation arises because surface chloride concentration is a complex parameter influenced by various factors, including environmental elements (chloride concentration of seawater, zonation, temperature, etc.), material properties (binder content, binder composition, water/binder ratio, etc.), and exposure time. For instance, Shakouri and Trejo [12] conducted experimental investigations to explore the impact of three factors, namely exposure time, water/cement (w/c) ratio, and chloride concentration of seawater, on surface chloride concentration. The findings revealed that exposure time plays a crucial role in influencing surface chloride concentration, exhibiting a significant increase as the duration of exposure time increases. Moreover, surface chloride concentration demonstrated a nonlinear escalation with higher chloride concentration in seawater. However, it was observed that the w/c ratio did not exert a substantial influence on surface chloride concentration. Furthermore, Gao et al. [13] noted that a rise in the w/c ratio is associated with an increase in surface chloride concentration.

In addition, Cai et al. [14] have compiled 8 empirical models from the existing literature that can be used to estimate the surface chloride concentration of concrete. These models include time-variant models that utilize logarithmic, power, exponential, and other mathematical functions. However, these models struggle to reasonably describe the developmental patterns of surface chloride concentration, where values increase rapidly in the early stages and tend to stabilize later on. Additionally, these models often overlook other crucial factors, such as material composition and environmental action classes. Li et al. [15] and Chalee et al. [11] have developed models predicting surface chloride concentration in relation to the water/binder ratio and exposure time but disregard the impact of binder type and environmental factors. Similarly, Marques et al. [16] have presented surface chloride concentration models incorporating variables related to materials and environmental conditions but overlook the exposure time effect. It is evident that these conventional quantitative surface chloride concentration models can only account for some, not all, of the influential factors. This limitation is attributed to the lack of an extensive data set and a robust methodology capable of considering such a multitude of variables.

In recent decades, the rapid expansion of big data has led to the widespread utilization of Machine Learning (ML) models across various domains of life [17,18] including civil engineering fields such as prediction and evaluation of pavement rutting depth [17], of compressive strength of concrete made from recycled aggregate concrete [19]. The coefficient of chloride diffusion could be accurately predicted with high performance using the ML models that were proposed in the investigations. Consequently, the challenge of establishing nonlinear connections between multivariables in civil engineering can be addressed through the application of the ML approach [17].

ML models have numerous benefits compared to conventional numerical models in analysis, resulting in enhanced performance. Conventional numerical models frequently depend on manually designed characteristics, which can be laborious and necessitate specialized knowledge in a particular field. On the other hand, ML models have the ability to autonomously acquire significant characteristics from the data, hence minimizing the requirement for manual feature engineering and perhaps revealing more complex patterns within the data [20]. Many real-world phenomena exhibit complex, nonlinear relationships that traditional numerical models struggle to capture. ML models, particularly deep learning models, are well-suited for capturing nonlinearities and can learn complex mappings between input and output variables, leading to more accurate predictions [21]. Traditional numerical models may struggle with high-dimensional data or data sets with a large number of features. ML models, however, are capable of handling high-dimensional data efficiently, making them suitable for tasks such as image classification, natural language processing, and genomic analysis [22]. ML models are designed to generalize well to unseen data, meaning ML model can make accurate predictions on new, previously unseen examples [23]. By learning underlying patterns from the training data, ML models can make predictions that are robust and applicable across different data sets and scenarios.

While some ML models, such as deep neural networks, are often criticized for their lack of interpretability, other ML techniques, such as Decision Trees (DT) and linear models, offer more transparent and interpretable representations of the learned relationships in the data. This interpretability can provide valuable insights into the underlying mechanisms driving the model’s predictions [24].

Moreover, in the ML community, the integration of physics principles into ML models has led to the development of Physics-Informed Machine Learning (PIML) models [25]. These models have garnered significant attention and have been widely applied across various applications due to their ability to leverage both data-driven techniques from ML and fundamental laws of physics. For instance, the PIML merging ML techniques with phenomenological laws is employed to consider multiple influencing factors and enhance the accuracy of fatigue life prediction [26]. Recent studies have demonstrated the successful application of ML techniques to solve partial differential equations in computational mechanics. Samaniego et al. [27] proposed an energy-based approach using ML, while Zhuang et al. [28] utilized a deep autoencoder-based energy method for the analysis of Kirchhoff plates. Furthermore, Guo and Yin [29] developed a novel physics-informed deep learning strategy with a local time-updating discrete scheme to tackle multi-dimensional forward and inverse consolidation problems.

More broadly, the significance of ML models lies in their ability to automate feature extraction, capture complex nonlinear relationships, handle high-dimensional data, scale effectively, generalize well, adapt to changing environments, and provide interpretable results. These advantages have led to the widespread adoption of ML techniques across diverse domains, including healthcare, finance, marketing, robotics, and many others.

Indeed, there have been studies aiming to forecast the surface chloride concentration of concrete through the utilization of ML models. For instance, Cai et al. [14] developed ensemble ML model used 642 samples and 12 input variables for predicting surface chloride concentration of concrete exposed to three seawater zone including tidal, splash, submerge, the performance of this ensemble ML model is achieved with R² = 0.83 and RMSE = 0.16% weight of concrete.

A ML model was developed to predict surface chloride concentration across all three zones, utilizing a data set consisting of 386 samples from the tidal zone, 122 samples from the splash zone, and 134 samples from the submerged zone. The uneven distribution of samples in the database used for constructing this ML model is noticeable and adversely impacts its performance in predicting surface chloride concentration. Exploring the influence of input variables on surface chloride concentration is yet to be undertaken to enhance the predictive capabilities of ML models in this regard. This understanding could greatly aid engineers in effortlessly evaluating the primary factors affecting the surface chloride concentration of concrete. Hence, the investigation also focuses on quantifying the effects of factors on the surface chloride concentration of concrete exposed to tidal environments using advanced ML techniques like Shapley Explanation Additive (SHAP) [30] and Partial Dependence Plot (PDP) [31].

The main aim of this work is to identify the most efficient ML model for predicting surface chloride content and evaluating the impact of different parameters. The criteria for selecting the best ML models are determined by two key factors: 1) the complexity of the training process, which directly impacts the computational time and resources required by civil engineers, and 2) the correctness and reliability of the ML model. The training process complexity depends on whether ML algorithms are used as ML algorithms or hybrid ML algorithms. Optimizing hyperparameters during the training phase, which is essential for improving model performance, can be done either manually or using meta-heuristic algorithms (also known as hybrid ML methods) [19]. Implementing hybrid algorithms can be intricate and time-consuming. On the other hand, ML algorithms, utilizing available hyperparameters Python libraries, simplify ML models, making them more user-friendly for civil engineers. Therefore, a default ML algorithm is selected for use during the training process. Each ML model is associated with specific default hyperparameters, which significantly influence the accuracy and reliability of surface chloride concentration predictions. Investigating various individual ML algorithms enhances the potential for selecting a high-performance model. This study proposes an evaluation of nine ML algorithms: Support Vector Regression (SVR), Adaptive Boosting (AdaBoost), Random Forest (RF), Extreme Gradient Boosting (XGB), Gradient Boosting (GB), Light Gradient Boosting (LightGB), DT, K-Nearest Neighbor (KNN), and Multivariate Adaptive Regression Splines (MARS). The ML models used in this study are popular supervised learning algorithms that can be found in the Sklearn libraries of the Python. Notably, XGB has demonstrated success in prominent online ML competitions on the Kaggle platform, particularly with structured data sets. Other well-known ensemble methods such as RF, GB and AdaBoost are also employed. Additionally, LightGB stands out as an effective framework developed by Microsoft.

The ML models are developed using a data set comprising 386 data samples of concrete exposed to tidal environments and 12 input variables summarized from Cai et al. [14]. The best ML model is evaluated using two performance metrics: the coefficient of determination (R²) and Root Mean Square Error (RMSE) through 10 repeats of 10-Fold Cross-Validation (CV). The best ML model is then used to predict surface chloride concentrations. In the final section, effect of each input variables on surface chloride concentration is explored by using SHAP and PDP One Dimension (1D).

2 Methodology flowchart of Machine Learning approach

In this research, nine ML algorithms were employed, utilizing default hyperparameters from the Python library. The algorithms include SVR, AdaBoost, RF, XGB, GB, LightGB, DT, KNN, and MARS. The description of nine ML is briefly presented in Section 4.

The ML investigation methodology for evaluating the surface chloride concentration of concrete is depicted in Fig.1. The investigation process unfolds in three main steps.

Fig.1 Methodology flowchart of this investigation.

Full size|PPT slide

Step I: Data description (Section 3). This section, covered in Section 3, encompasses statistical analysis, matrix correlation, and histogram data for each variable in the database.

Step II: Training ML model and evaluating performance of ML model (Subsection 5.1). The data set undergoes a 70/30 split [32], with 70% used for training and evaluating the ML models performance. The training involves 12 input variables, aiming to predict the surface chloride concentration of concrete. Performance evaluation is conducted using R². The reliability and performance of ML model results are validated through techniques like 10 repeats of 10-Fold Cross Validation. This thorough evaluation of nine ML models aids in selecting the top four for subsequent investigations.

Step III: Prediction comparison and SHAP-PDP 1D. Using the four best ML models identified in step II, prediction of surface chloride concentration of concrete is conducted on the testing data set, the accuracy of prediction is demonstrated through R² and RMSE. Use these 2-performance metrics to compare the accuracy of the four best ML models in predicting surface chloride concentration of concrete. Based on the comparison results, the 2 best ML models are determined for investigate quantitative and qualitative effects of 12 input variables on surface chloride concentration of concrete (Subsection 5.2). This quantitative and qualitative research was conducted based on the SHAP technique. Finally, the best ML model will be used in combination with the 1D PDP technique to evaluate the specific quantitative effect of each input variable on surface chloride concentration of concrete. This specific quantity is compared using a simple linear correlation describing the relationship between the 12 output variables and the surface chloride concentration of concrete (Subsection 5.3).

3 Data description

This study leverages a comprehensive data set comprising 386 samples, encompassing 12 input variables and one output variable. The data, sourced from Cai et al. [14], represents experimental insights into factors influencing the surface chloride concentration of concrete. The input variables include crucial parameters such as cement content, Fly Ash (FA) content, Ground Granulated Blast Furnace Slag (GGBFS) content, Silica Fume (SF), superplasticizer content, water content, fine aggregate content, coarse aggregate content, water/binder ratio, exposure time, annual mean temperature of the environment, and chloride concentration of seawater. The sole output variable is the surface chloride concentration of concrete.

Tab.1 provides a comprehensive overview of statistical measures, including mean, median, minimum, maximum, quartiles (Q25%, Q75%), and StD for all variables. Simultaneously, Fig.2 visually presents histograms depicting the distribution of variable values. The statistical values offer insights into the central tendencies and variability of each variable.

Tab.1 Statistical values of database including Mean, Median, Minimal, Maximal, Quartile 25%, Quartile 75%, and Standard Deviation (StD)

Variable	Unit	Mean	StD	Min	Q25%	Median	Q75%	Max
Cement	kg/m³	364.64	71.45	157.50	340.00	360.00	406.00	480.00
FA	kg/m³	46.43	70.36	0.00	0.00	0.00	72.00	239.00
GGBFS	kg/m³	12.90	45.62	0.00	0.00	0.00	0.00	292.50
SF	kg/m³	5.37	13.04	0.00	0.00	0.00	0.00	50.00
Superplasticizer	kg/m³	1.28	1.70	0.00	0.00	1.00	1.00	7.00
Water	kg/m³	200.05	46.25	140.00	180.00	180.00	215.00	311.00
Fine aggregate	kg/m³	749.62	122.51	552.00	639.00	800.00	800.00	1232.00
Coarse aggregate	kg/m³	984.76	130.58	410.00	957.50	1000.00	1020.00	1269.00
Water/binder	–	0.47	0.08	0.34	0.40	0.45	0.50	0.65
Exposure time	year	4.50	6.91	0.08	1.35	3.00	5.00	48.65
Annual mean temperature	°C	19.14	9.50	7.00	10.00	16.50	30.00	30.00
Chloride concentration	g/L	19.08	2.90	13.00	17.00	19.00	19.80	27.37
Surface chloride concentration	% (weight of concrete)	0.834	0.37	0.22	0.50	0.79	1.13	1.95

Fig.2 Histogram data of each variable: (a) cement content; (b) FA content; (c) GGBFS content; (d) SF content; (e) superplasticizer content; (f) water content; (g) fine aggregate content; (h) coarse aggregate content; (i) water/binder; (j) exposure time; (k) annual mean temperature; (l) chloride concentration of seawater; (m) surface chloride content.

Full size|PPT slide

For example, cement content ranges from 157.30 to 480.00 kg/m³, with an average of 364.64 kg/m³, predominantly concentrated between 300 and 480 kg/m³. Similarly, FA content spans from 0.00 to 239.00 kg/m³, averaging 46.43 kg/m³, with a significant number of samples distributed between 50 and 239 kg/m³. GGBFS content ranges from 0.00 to 292.00 kg/m³, with limited utilization in the models.

Superplasticizer content varies from 0.00 to 7.00 kg/m³, with a predominant concentration around 1.00 kg/m³. Water content exhibits an average of 200.05 kg/m³, ranging from 140.00 to 311.00 kg/m³, widely distributed around 180 kg/m³. Fine aggregate content ranges from 552.00 to 1232.00 kg/m³, averaging 749.62 kg/m³, with a substantial presence around 800 kg/m³.

Coarse aggregate content varies from 410.00 to 1269.00 kg/m³, averaging 984.76 kg/m³, displaying widespread distribution between 850 and 1269.00 kg/m³. Water/binder ratio ranges from 0.34 to 0.65, averaging 0.47, with a concentration around 0.45.

Exposure time spans from 0.08 to 48.65 years, averaging 4.5 years, primarily concentrated between 0.08 years and 10 years. The annual mean temperature averages 19.14 °C, with samples predominantly distributed between 16 °C and approximately 20 °C.

Chloride concentration in seawater varies from 13.00 to 27.37 g/L, averaging 19.08 g/L, with a notable concentration around 19.00 g/L. The surface chloride concentration ranges from 0.22% weight of concrete to 1.95% weight of concrete, showcasing a standard distribution. Key percentiles include Q25% at 0.50% weight of concrete, the median at 0.79% weight of concrete, and Q75% at 1.13% weight of concrete.

The input and output variables relationship are depicted through matrix of Pearson correlation, exhibiting a direction and strength comparable to the Pearson correlation coefficient (R). The formula for the R is defined as follows:

(1)

R = \frac{\sum_{i = 1}^{N} (i n p_{i} - \bar{i n p}) ({o u t}_{i} - \bar{o u t})}{\sqrt{\sum_{i = 1}^{N} {(i n p_{i} - \bar{i n p})}^{2}} \sqrt{\sum_{j = 1}^{N} {(o u t_{i} - \bar{o u t})}^{2}}},

where N is the sample size;

i n p_{i}

o u t_{i}

are the individual sample i of the two variables;

\bar{i n p}

\bar{o u t}

are the mean values of whole samples of the two variables.

The detailed analysis of Pearson correlation, encompassing both inputs and the output, is presented in Fig.3. The matrix of Pearson correlation (Fig.3) highlights that most input variables exhibit weak correlations, with absolute values ranging from 0.00 to 0.49 concerning the output variable “surface chloride concentration.” Notably, the strongest value correlation is observed between Water content and surface chloride concentration, with a correlation coefficient of 0.49. This suggests that an increase in water content in the mix design corresponds to an elevation in the surface chloride concentration of concrete.

Fig.3 Matrix correlation value for all variables.

Full size|PPT slide

Examining the correlations among the 12 input variables, the most robust correlation is identified between water content and water/binder ratio, yielding a coefficient of 0.88. Despite assessing feature importance through the Pearson correlation coefficient, none of the coefficients are deemed strong enough to warrant the reduction of recommended inputs. Consequently, all 12 input variables are considered beneficial for feature importance analysis.

The Pearson correlation coefficient, capable of measuring only linear relationships between two variables, ranges from −1.0 to 1.0. The coefficients between input vs output variable are presented in last row of matrix (Fig.3). However, it is crucial to note that an R value close zero does not signify independence between two variables. This limitation, inherent in using Pearson correlation description, is underscored by the nonlinear relationships evident in the scatter distribution plot of Fig.1.

This section aims to evaluate the strength of relationships and data distribution to mitigate variable noise, ultimately enhancing the performance for ML models. Moreover, these correlations will undergo thorough assessment using different ML techniques such as SHAP and PDP 1D in the last section to quantify the correlations.

4 Machine Learning approach

A branch of soft computing known as ML may automatically extract information from data without the need for explicit, intricate programming. Algorithms for ML are based on the ability to obtain data and apply it for self-learning. The fundamental objective of ML is to enable computers to swiftly acquire process knowledge on their own in the absence of human expertise, enabling them to modify activities far more effectively. An inference function with accurate output value predictions is produced by an ML algorithm. Several models can be utilized to carry out the necessary conversion from input to output. To determine a link between inputs and outputs, several ML methods have been developed. ML algorithms carry out the learning process and express information using various models and methodologies. In this study, we have estimated the surface chloride concentration of concrete exposed to tidal zone of sea water with using nine different ML algorithms. In this study, nine ML algorithms are used: SVR, AdaBoost, RF, XGB, GB, LightGB, DT, KNN, and MARS.

4.1 Machine Learning algorithms

4.1.1 Support Vector Regression

SVR, a prominent supervised-learning algorithm, is widely utilized for classification and regression tasks. Originating from the work of Vapnik and colleagues in 1963, SVR gained prominence in the 90 s, particularly for addressing nonlinear problems through the Kernel method [33]. SVR operates by identifying a hyperplane that effectively segregates data points, creating distinct regions for each data type. The crux of SVR lies in selecting the optimal hyperplane, one that maximizes the margin between different regions. This margin is a crucial aspect, as ML theory affirms that it minimizes the error limit, contributing to the robustness of SVR in handling diverse classification and regression challenges.

4.1.2 Adaptive Boosting Regressor

AdaBoost stands out as an ensemble learning method that utilizes a boosting strategy, specifically training weak learners in the form of DTs. Renowned for pioneering an adaptive approach to handling poor learners, AdaBoost, short for adaptive boosting, has gained prominence in the field [34]. Its distinctive feature lies in aggregating a multitude of weak learners and iteratively training them on copies of the original data set, with a focus on the most challenging data outliers or points [35]. AdaBoost, involves creating N copies of weak learners, training them on the same feature set, but with varying weights, resembling a metadata model.

4.1.3 Random Forest

RF, initially introduced by Breiman [36], is a ML algorithm that leverages multiple classification or regression trees grouped together. Rooted in the DT model, RF employs an ensemble of trees, and each tree contributes as a vote in the algorithm’s decision-making process. The collective learning approach, amalgamating individual outcomes from each tree, often leads to superior results. RF extends the principles of bagging (Bootstrap Aggregation), employing random training data samples repetitively to construct numerous regression trees without pruning. The final outcome is the aggregation of the averages of these trees, contributing to the robustness and predictive power of the algorithm. RF’s strength lies in its ability to mitigate overfitting, enhance model generalization, and accommodate complex data sets by leveraging the diversity of multiple trees. This ensemble learning technique has become a cornerstone in various ML applications, offering versatility and high performance in classification and regression tasks.

4.1.4 Gradient Boosting

GB stands as a composite algorithm that strategically employs boosting techniques to formulate a sophisticated predictive tool inspired by the principles of gradient descent [37]. At its core, boosting commences with the construction of a tree to unveil the intricate relationships between input and output variables. Subsequent trees are iteratively developed to rectify errors and refine predictions. GB conceptualizes boosting as an optimization problem, framing it within the context of minimizing a loss function to mitigate errors and enhance model accuracy. This approach sets GB apart as a powerful ensemble learning method that iteratively improves predictive performance. Each tree in the ensemble focuses on addressing the shortcomings of its predecessor, continuously refining the model’s predictive capabilities. By viewing boosting as an optimization challenge, GB harnesses the principles of gradient descent to navigate the complex landscape of model optimization.

4.1.5 Extreme Gradient Boosting

XGB represents an enhanced iteration of the GB algorithm, originally formulated by Friedman [31]. The fundamental principle underlying GTB involves sequentially combining weak essential learning trees, characterized by high error, into a more robust learning model tree. XGB builds upon this foundation by introducing a regularization component to the loss function, a crucial addition for assessing the model’s complexity and improving performance. This regularization aspect serves to optimize the learning model’s parameters and counteract overfitting, contributing to a more balanced and accurate model. At its core, XGB tackles optimization challenges related to the objective function, employing a gradient-enhanced framework that facilitates efficient handling of diverse data science problems. The parallel boosting trees further enhance XGB’s capability to swiftly and accurately address a wide range of data science challenges.

4.1.6 Light Gradient Boosting Machine

The LightGB stands as an open-source toolkit that presents an efficient implementation of a GB framework rooted in tree-based learning methods [38]. Developed by Microsoft, LightGB algorithm is characterized by notable advantages, including accelerated training speed, high accuracy, dependability, and minimal memory usage during execution. LightGB excels in efficiently handling large-scale data, particularly in regression scenarios. Its innovative approach has garnered attention, and the algorithm is easily deployable using the Python languge, with a comprehensive parameters list available in the LightGB guide [39]. This makes LightGB a versatile and accessible tool for practitioners seeking a robust and efficient GB framework for their ML tasks.

4.1.7 Decision Tree

DT, introduced by Quinlan [40], stands as a supervised ML model designed to address classification or regression challenges. DT constructs a predictive model by employing data-driven decision rules to estimate the target variable. Comprising root, internal, and leaf nodes, along with branches, a DT’s structure guides its decision-making process. The root node initializes with the value of the first branching variable, while internal nodes encapsulate variables representing characteristics used to evaluate subsequent branches. Leaf nodes store the value of the corresponding category variable, and branches embody the rules governing the relationship between independent and target variable values. This hierarchical structure empowers DT to effectively model and infer complex relationships within data sets.

4.1.8 K-Nearest Neighbors

The KNN algorithm constitutes a supervised learning method in ML [41,42]. This method predicts the output associated with a new input, denoted as x, by considering the k training samples whose inputs closely resemble x based on a distance metric to be determined. The effectiveness of this technique can be enhanced through normalization, particularly since it relies on distance computations. Operating as a nonparametric approach, the KNN algorithm serves purposes in classification or regression. In classification, the task involves assigning the item to the category to which the majority of the k nearest neighbors belong in the space of identified features. For regression, the KNN technique provides the value for the item, determined as the average of the k nearest neighbors’ values. This approach, rooted in weak learning, evaluates functions locally during the learning phase. Despite its simplicity, the KNN method stands as one of the foundational algorithms in ML.

4.1.9 Multivariate Adaptive Regression Splines

Designed for addressing multivariate nonlinear regression challenges, MARS provides a targeted solution [43]. Tailored for regression problems, MARS accommodates multiple input variables (often tens) and operates in a nonlinear space, indicating the absence of a singular straight line to represent the relationship between inputs and the target variable. To make predictions, MARS seeks a set of simple piecewise linear functions that effectively capture the data’s characteristics, essentially forming a collection of linear functions. Each of these functions, termed basis functions, is produced in abundance by the MARS algorithm. The subsequent training involves constructing a linear regression model using the output of these basis functions and the target variable, with each basis function’s output weighted by a coefficient. This implies that the forecast is generated by summing the weighted outputs of all basis functions. As the model evolves, additional basis functions are generated, and more data are incorporated to refine the predictive capabilities. The iterative nature of MARS ensures adaptability and effectiveness in capturing the intricacies of nonlinear relationships within multivariate regression scenarios.

4.2 Performance metrics for evaluating Machine Learning model

Evaluating a ML model involves considering performance metrics. These metrics play a crucial role in assessing the effectiveness of the model. Two commonly utilized criteria for this purpose are R² and RMSE. R² is a standard measure, especially in the context of regression analysis by ML models, providing insight into the variation between true and predicted data. On the other hand, RMSE is valuable for quantifying the mean magnitude of errors, particularly in the presence of significant discrepancies. The assessment of a model’s performance often relies on these metrics, and their calculation involves specific equations:

(2)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - y_{i}^{p r e})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}},

(3)

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {(y_{i} - y_{i}^{p r e})}^{2}}{n}},

where

y_{i}

represents the true data,

y_{i}^{p r e}

is the predicted data,

\bar{y}

is the mean of true data, and n is the number of data samples.

4.3 Repeated K-Fold Cross-Validation

The widely adopted K-Fold CV technique assesses ML models on data sets [44]. During each iteration, one of the K-Folds serves as a test set, while the remaining folds collectively form the training data set. The process repeats until each fold has been utilized as a test set. The toolkit, scikit-learn, facilitates the implementation of the K-Fold CV method.

However, K-Fold CV model performance estimates can exhibit noise due to variations in data set splitting, impacting performance scores and mean estimates. The choice of model and data set influences the variability in predicted performance across different K-Fold CV runs, making it challenging to obtain a stable estimation for model performance. To mitigate noise, increasing the k-value is an approach, introducing higher variance but reducing bias of model’s performance. Another strategy is repeated K-Fold CV, involving multiple runs of K-Fold CV process and reporting mean performance. This study opts for 10 repeats of 10-Fold CV, as suggested by Brownlee [45], as illustrated in Fig.4. This approach provides a more robust assessment of model performance by capturing the variability across multiple repetitions of the K-Fold CV procedure.

Fig.4 Schema of 10-Fold cross validation and 10 repeats of 10-Fold cross validation.

Full size|PPT slide

4.4 Input variable effect using Shapley Additive Explanations and Partial Dependence Plot

The SHAP technique is grounded in game theory concepts, serving as a foundation for understanding predictions [30]. In this framework, each input variable value is likened to one player in a game. The determination of contributions involves the iterative process of including and expelling each player from all other player subsets. The value of SHAP assigned to one player is then derived as sum of their contributions. This concept can be viewed as a combination of specific precision and addition. When these values of SHAP are aggregated by adding them to the value in base, which represents the average prediction value, a good prediction value is obtained.

In their work, Molnar [46], elucidates the SHAP technique, which aims to explicate the specific prediction (denoted as x) by estimating the contribution of each input variable. The SHAP values are computed through the application of the SHAP explanation approach, which is rooted in the coalitional theory of game. In this context, the feature values of a data instance function as players in a coalition.

One distinctive aspect introduced by SHAP is the Shapley value explanation, which is depicted as an additive feature attribution approach that resembles a linear model. According to SHAP’s justification, this approach offers a nuanced perspective on understanding the contribution of individual features to predictions. This novel methodology addresses the interpretability of complex models, providing invaluable insights into how each feature influences the overall prediction.

The SHAP technique presents a significant advancement in the field of model interpretability, as it enables a more comprehensive understanding of the complex relationships between input variables and the target prediction. By employing the Shapley value explanation, SHAP offers a systematic and principled approach to unravel the black-box nature of advanced ML models, ultimately enhancing the transparency and trustworthiness of these powerful predictive tools. In essence, SHAP leverages game theory principles to unravel the intricacies of prediction by assigning values to each feature based on their contributions. The resulting Shapley values encapsulate the fairness and equity in distributing the prediction among the features, contributing to a more transparent and interpretable understanding of ML model predictions:

(4)

f (z) = ϕ_{0} + \sum_{j = 1}^{M} ϕ_{j} z_{j} .

In the SHAP paradigm, denoted by f, the explanation model integrates a coalition vector

z \in {0, 1}^{M}

, where

ϕ_{j} \in R

represents the input variable attribution j in the SHAP value calculation [30]. The coalition vector z outlines the composition of players (feature values) within the coalition, and M denotes the maximum size of this coalition.

ϕ_{j} \in R

is pivotal in determining the impact of individual features on the overall prediction, contributing to the interpretability of complex ML models. This inclusion of the coalition vector and the concept of maximum coalition size (M) enhances the sophistication of SHAP, offering a detailed exploration of feature attributions within coalitional game theory.

In real-world scenarios, the SHAP value should calculate a weighted average across all potential feature ranks if the model is nonlinear or the input features are not independent. The following equation is used by SHAP to get the attribution value

ϕ_{j}

for each character. It does this by fusing these conditional expectations with traditional Shapley values from game theory.

(5)

ϕ_{j} = \sum_{S \subseteq {x_{1}, \dots, y_{p}} ∖ {x_{j}}} \frac{| S |! (p - | S | - 1)!}{p!} (f_{x} (S \cup {x_{j}}) - f_{x} (S)),

where

{x_{1}, \dots, x_{p}}

is the set of all input features, p is the number of all input features,

{x_{1}, \dots, x_{p}} ∖ {x_{j}}

is the possible set of all input features excluding

{x_{j}}

f_{x} (S)

is the prediction of feature subset S.

The weight of the formula

\frac{| S | (p - | S | - 1)!}{p!}

can be understood as follows.

Denominator: There are p! combinations of p features in any order.

Numerator: After determining the subset S, there are

| S |! (p - | S | - 1)!

combinations of p features under a specific ordering. When the subset S is determined, the set of features should be

{x_{1}, \dots, x_{| S |}, x_{j}, x_{| S | + 2}, \dots, x_{p}}

, the subset S

{x_{1}, \dots, x_{| S |}}

itself has |S|! kinds of sequential combinations, followed by feature j, then the remaining features

{x_{| S | + 2}, \dots, x_{p}}

have

(p - | S | - 1) ∣

combinations kinds,, after determining the subset S with the combinations kinds

| S |! (p - | S | - 1)!

\frac{| S | (p - | S | - 1)!}{p!}

is the proportion of feature combinations of subset S, and the sum of all possible proportions of feature combinations of subset S is equal to 1.

This holistic understanding contributes to the transparency and interpretability of the explanation model, enabling users to comprehend how individual features collectively shape ML predictions.

5 Results and discussion

5.1 Performance analysis of Machine Learning model

In this section, the performance of ML models is compared based on 10 repetitions of 10-Fold CV. Fig.5 presents a boxplot illustrating the R² values from 10 repetitions of 10-Fold CV for nine ML models, including SVR, AdaBoost, RF, XGB, GB, LightGB, DT, KNN, and MARS. This boxplot displays the minimum, maximum, and median values of the R² scores for each repeat. The results indicate that the SVR model exhibits the lowest performance, with significant variability in the R² values centered around a median of approximately 0.4. In contrast, the GB model demonstrates the highest performance, with minimal variation in R² values around a median of approximately 0.8. The MARS model performs better than the SVR model but is outperformed by the KNN model, with median R² values of approximately 0.55 for MARS and 0.60 for KNN. The R² value variation in each repetition for both KNN and MARS models appears to be comparable.

Fig.5 Boxplot of 10 repeats of 10-Fold CV for R² value of 9 ML models: (a) SVR; (b) AdaBoost; (c) RF; (d) XGB; (e) GB; (f) LightGB; (g) DT; (h) KNN; (i) MARS.

Full size|PPT slide

The remaining 5 models, including XGB, LightGB, DT, AdaBoost, and RF, appear to exhibit comparable performance, making it challenging to distinguish them. This is evident as the median R² values of these 5 models are quite similar, hovering around 0.8, and their variation ranges appear relatively indistinguishable when represented on a boxplot.

The average value for each repeat and the StD for each repeat are compiled in Tab.2 and Tab.3, respectively. This allows for a relative quantification of the performance of each ML model in predicting the surface chloride concentration of concrete exposed to a tidal zone.

Tab.2 Value of R² in each time of 10 repeats of 10-Fold CV for 9 ML models

Time of repeats	SVR	AdaBoost	RF	XGB	GB	LightGB	DT	KNN	MARS
1	0.390	0.758	0.688	0.749	0.785	0.769	0.717	0.591	0.535
2	0.379	0.761	0.665	0.742	0.779	0.778	0.708	0.594	0.525
3	0.379	0.742	0.672	0.727	0.773	0.768	0.693	0.590	0.507
4	0.394	0.736	0.662	0.722	0.763	0.764	0.684	0.587	0.519
5	0.383	0.759	0.682	0.739	0.780	0.773	0.706	0.587	0.516
6	0.373	0.754	0.679	0.737	0.777	0.769	0.706	0.582	0.531
7	0.384	0.744	0.674	0.726	0.776	0.767	0.690	0.591	0.526
8	0.385	0.753	0.670	0.739	0.776	0.764	0.705	0.574	0.526
9	0.388	0.751	0.676	0.730	0.775	0.768	0.693	0.579	0.520
10	0.379	0.748	0.665	0.732	0.773	0.763	0.696	0.589	0.528

Tab.3 Value of StD in each time of 10 repeats of 10-Fold CV for 9 ML models

Time of repeats	SVR	AdaBoost	RF	XGB	GB	LightGB	DT	KNN	MARS
1	0.048	0.027	0.024	0.026	0.024	0.021	0.030	0.040	0.040
2	0.04	0.034	0.030	0.038	0.031	0.027	0.040	0.027	0.022
3	0.043	0.019	0.018	0.019	0.015	0.014	0.021	0.024	0.023
4	0.033	0.020	0.019	0.021	0.017	0.016	0.021	0.022	0.024
5	0.025	0.013	0.013	0.015	0.013	0.011	0.015	0.020	0.019
6	0.022	0.015	0.015	0.016	0.013	0.012	0.017	0.018	0.016
7	0.023	0.014	0.011	0.016	0.013	0.012	0.017	0.017	0.015
8	0.023	0.013	0.012	0.014	0.012	0.011	0.015	0.017	0.016
9	0.020	0.012	0.012	0.013	0.011	0.010	0.014	0.016	0.014
10	0.019	0.012	0.012	0.013	0.011	0.011	0.013	0.016	0.013

To make a specific comparison of the performance of the nine ML models, the mean and StD of 10 iterations of 10-Fold CV are gathered in Tab.4, along with the rank of model performance. The ranking of ML model performance is arranged from low to high based on the average R² value. The results in this table help clearly rank the model accuracy in the order of SVR < MARS < KNN < RF < DT < XGB < AdaBoost < LightGB < GB. However, the reliability of the LightGB model appears slightly better than the GB model, as the average StD value of the LightGB model is the lowest at 0.015, followed closely by the GB model at 0.016.

Tab.4 Mean value of R² and Mean value of StD of 10 repeats of 10-Fold CV for 9 ML models

Model	Mean value of R²	Rank	Model	Mean value of StD	Rank
SVR	0.383	9	SVR	0.030	9
MARS	0.523	8	KNN	0.022	8
KNN	0.586	7	DT	0.020	7
RF	0.673	6	MARS	0.020	6
DT	0.700	5	XGB	0.019	5
XGB	0.734	4	AdaBoost	0.018	4
AdaBoost	0.751	3	RF	0.017	3
LightGB	0.768	2	GB	0.016	2
GB	0.776	1	LightGB	0.015	1

Among the nine ML models, four models with average R² values for 10 repeats of 10-Fold CV ranging from 0.734 to 0.776 are retained for comparing accuracy in predicting the surface chloride concentration on the same training and validation data set in the next section.

The results of 10 repeats of 10-Fold CV demonstrate that the GB model exhibits the highest accuracy in predicting surface chloride concentration compared to other ML models in this study. It can be said that each ML model has its own advantages for specific tasks. For example, Lin et al. [47] proposed the use of 7 ML models including AdaBoost, GB, bagging, extra trees, RF, hist GB, voting, and stacking for slope stability prediction. The results showed that the voting and stacking models performed best in predicting slope stability. Also related to slope stability prediction, in the investigation by Lin et al. [48], seven state-of-the-art ensemble models and several ML models were evaluated, and the authors demonstrated that the CatBoost model had the best predictive performance. In Lin et al.’s study [49], combined with variational autoencoder to address imbalanced data, among six state-of-the-art ensemble models, including the GB model and the classical logistic regression model, the GB model exhibited the best performance in rock burst assessment. The power of GB lies in its ability to sequentially train weak learners and combine them into a strong predictive model. Through iterative refinement, GB model minimizes errors by focusing on instances where previous models have performed poorly, resulting in highly accurate predictions.

5.2 Prediction of surface chloride concentration using four best Machine Learning models

The surface chloride concentration on the same testing data set is predicted using the four ML models: AdaBoost, XGB, LightGB, and GB. The predicted results are then compared to the actual values. Fig.6 illustrates this comparison by simultaneously displaying the prediction results on both the training and testing sets. Two performance metrics—R² and RMSE—are employed to quantify the prediction model’s accuracy.

Fig.6 Comparison of surface chloride concentration predicted by 4 ML models: (a) AdaBoost; (b) XGB; (c) LightGB; (d) GB.

Full size|PPT slide

It is evident that the XGB model exhibits very high forecasting accuracy on the training set with R² = 0.865 and RMSE = 0.136. Conversely, the AdaBoost model has the lowest predictive accuracy on the training set with R² = 0.713 and RMSE = 0.199. The LightGB and GB models show relatively similar predictive accuracy on the training set, with R² = 0.823 and RMSE = 0.156 for LightGB and R² = 0.842 and RMSE = 0.148 for GB. These results suggest that the accuracy of predicting the surface chloride concentration of concrete by ML models ranks in the order of AdaBoost < LightGB < GB < XGB, which seems contrary to the findings in the previous research section.

However, it is essential to utilize the accuracy of the ML model on the testing data set as a basis for comparison. Therefore, Fig.6 demonstrates that the GB model has the highest forecasting accuracy for the testing data set with R² = 0.920 and RMSE = 0.109. Following GB, the LightGB model has R² = 0.911 and RMSE = 0.103, and XGB has the third-highest predictive accuracy with R² = 0.899 and RMSE = 0.116. AdaBoost has the lowest accuracy for the testing data set with R² = 0.808 and RMSE = 0.159.

The accuracy of the ML model depends on the quality and quantity of data. The quality of data refers to the statistical distribution of data values, and achieving a normal distribution contributes to data quality. Randomly splitting the data into a 70/30 ratio for the training and testing data sets affects the R² values of each data set. Each split results in different R² values for these data sets. Therefore, the Repeated K-Fold CV technique is used to determine the reliability of the ML model. The average result of Repeated K-Fold CV represents the model’s performance. According to the R² value of testing data set, the R² result on the testing data set being better than on the training data set is just one of the best performances among many ML performances. This result confirms that the accuracy of predicting the surface chloride concentration of concrete is arranged in ascending order: GB > LightGB > XGB > AdaBoost, which aligns with the analysis in Subsection 5.1. Therefore, the LightGB and GB models are proposed for further feature importance analysis using tools such as SHAP and PDP in the next section.

Additionally, to assist non-IT engineers in directly applying the GB ML model to predict the surface chloride concentration of concrete exposed to a tidal zone, an Excel file generated from the GB model is provided for direct prediction using the proposed 12 input variables. The Excel file can be found in the Supplementary file.

5.3 Identification of importance feature on surface chloride concentration

5.3.1 Shapley Additive Explanations of predicted surface chloride concentration

This section utilizes the LightGB and GB ML models, which have demonstrated superior performance, to analyze the importance of features in predicting the surface chloride content of concrete. This analysis is conducted using SHAP values. The purpose of this analysis is to evaluate the impact of the 12 input factors on the surface chloride concentration value, using both qualitative and quantitative methods. Fig.7 displays the SHAP values and mean absolute SHAP values for two ML models, LightGB (Fig.7(a)) and GB (Fig.7(b)). Each histogram illustrates the impact of variables based on the mean absolute SHAP value (right side of Fig.7) and quantifies the relative influence of these variables on the output surface chloride concentration value based on the SHAP value (left side of Fig.7)

Fig.7 Quanlitatively and Quantitatively effect of 12 input variables on surface chloride concentration of concrete: (a) SHAP value based on LightGB model; (b) SHAP value based on GB model.

Full size|PPT slide

Confirming the robustness and accuracy of importance assessments, feature importance analyses such as SHAP values, which are based on a variety of ML models, can produce similar results. This is due to the fact that each ML model is constructed using various algorithms. The concurrent application of SHAP values derived from the GB model and LightGB model results confirms the interpretability of predicted surface chloride concentration values, thereby enabling a precise comprehension of the mechanism of feature influence on surface chloride concentration values. Tab.5 shows the specific impact of different input variables, including 3 cases: 12, 9, and 6 input variables on the performance of GB model. The selection to retain input variables for the GB model is based on SHAP interpretation of predicted results (Fig.7). Found that using 9 input variables including Fine aggregate content, Exposure time, Annual mean temperature, Coarse aggregate, FA, Water/Binder, excluding 3 input variables Water, GGBFS, Superplasticizer does not greatly affect the performance of the GB model, however, if the 6 variables SF, chloride concentration, Cement, Water, GGBFS, Superplasticizer are removed, the performance of the GB model in predicting surface chloride concentration is sharply reduced in comparing with the GB performance using 12 input variables.

Tab.5 Performance of GB model with using different input variables

Number of input variables	Number of input variables no used	Training data set	Testing data set
12: Fine aggregate content, exposure time, annual mean temperature, coarse aggregate, FA, water/binder, SF, chloride concentration, cement, water, GGBFS, superplasticizer	0	R² = 0.842RMSE = 0.148	R² = 0.920RMSE = 0.103
9: Fine aggregate content, exposure time, annual mean temperature, coarse aggregate, FA, water/binder, SF, chloride concentration, cement	3: water, GGBFS, superplasticizer	R² = 0.841RMSE = 0.149	R² = 0.919RMSE = 0.104
6: Fine aggregate content, exposure time, annual mean temperature, coarse aggregate, FA, water/binder	6: SF, chloride concentration, cement, water, GGBFS, superplasticizer	R² = 0.825RMSE = 0.156	R² = 0.878RMSE = 0.127

It is important to observe that the SHAP values, which are derived from the LightGB and GB models, indicate the impact of two groups of input factors on the concentration of surface chloride. A group of four factors, arranged in decreasing order of influence, has the greatest impact on surface chloride concentration: Exposure time > Fine aggregate content > Annual mean temperature > Coarse aggregate content. Surface chloride concentration values are less influenced by the following groups: water/binder ratio, SF content, cement content, FA content, water content, superplasticizer content, and chloride concentration of seawater.

Specifically, GGBFS content and superplasticizer content appear to have an insignificant effect on surface chloride concentration. Superplasticizer helps adjust the workability of concrete mixture without significantly affecting other mechanical properties of concrete. The relatively small number of samples containing GGBFS (30/386 samples) may explain why GGBFS content does not seem to impact surface chloride concentration, reflecting the nature of the small sample data rather than the practical influence of GGBFS on surface chloride concentration. Additionally, cement content and water content seem to have minimal effects on surface chloride concentration.

Although chloride concentration of seawater has a small effect on surface chloride concentration compared to the group of four factors with the greatest influence, the SHAP values indicate that an increase in chloride concentration of seawater also results in a relative increase in surface chloride concentration. With a larger data set, this result strengthens the findings of Shakouri and Trejo [12,50], suggesting that surface chloride concentration increases nonlinearly with an increase in the concentration of chlorides in the environment.

Both SHAP values based on LightGB and GB models generally show that increasing SF content, FA content, and Coarse aggregate content leads to an increase in surface chloride concentration. In concrete design, fine aggregate content is the only ingredient that increases surface chloride concentration when raised; simultaneously, it is the most critical factor affecting the surface chloride concentration of concrete. Among the four most influential factors, exposure conditions, including exposure time and annual mean temperature, positively influence surface chloride concentration, indicating that an increase in these two factors induces higher surface chloride concentration values.

To provide a more specific evaluation of the trend of influence of variables, including the four most important factors, water/binder ratio, and supplementary cementitious materials such as FA and SF content on the surface chloride concentration value, the results of PDP 1D using the GB ML model will be examined in the next section.

5.3.2 Quantitative factor effect of input variables by Partial Dependence Plot

In this section, the influence of 8 important factors on surface chloride concentration out of the 12 input variables will be quantitatively analyzed using PDP 1D. Simultaneously, the results of the PDP will be compared with simple correlation, revealing the superiority of the correlation study using the PDP technique over Pearson correlation. Thus, Fig.8 presents the Pearson correlation of variables with true surface chloride concentration and 1D PDP between input variables and predicted surface chloride concentration.

Fig.8 Effect of factors on surface chloride concentration of concrete: (a) fine aggregate content and surface chloride described by simple correlation; (b) fine aggregate content and surface chloride described by PDP 1D; (c) exposure time and surface chloride described by simple correlation; (d) exposure time and surface chloride described by PDP 1D; (e) annual mean temperature and surface chloride described by simple correlation; (f) annual mean temperature and surface chloride described by PDP 1D; (g) coarse aggregate content and surface chloride described by simple correlation; (h) coarse aggregate content and surface chloride described by PDP 1D; (i) water/binder and surface chloride described by simple correlation; (j) water/binder and surface chloride described by PDP 1D; (k) SF content and surface chloride described by simple correlation; (l) SF content and surface chloride described by PDP 1D; (m) FA content and surface chloride described by simple correlation; (n) FA content and surface chloride described by PDP 1D.

Full size|PPT slide

Fig.8(b) illustrates that increasing the use of fine aggregate content allows for a reduction in surface chloride concentration, especially from around 552 to 711 kg/m³. Beyond this level, the trend remains relatively stable at the value of 0.4% weight of concrete, as evidenced by the actual data in Fig.8(a), confirming the negative influence of fine aggregate content on surface chloride concentration with a Pearson correlation coefficient of −0.41 (Fig.3).

The Pearson correlation coefficient in Fig.8(c) is −0.12, indicating that the influence trend of exposure time on surface chloride concentration seems to be negative. Increasing exposure time induces a decrease in surface chloride concentration, which appears to contradict the conclusion of Shakouri and Trejo [12], who found that surface chloride concentration increases nonlinearly with exposure time. In reality, Fig.8(d), with the PDP analysis results, shows a linear increase in surface chloride concentration with exposure time, trending up by 0.4% weight of concrete after 5 years of exposure. However, after this period, the surface chloride concentration shows signs of being relatively stable, with almost very small changes. This result holds significant implications for assessing exposure time’s impact on surface chloride concentration when utilizing a relatively large sample size.

Determining the influence of annual mean temperature and coarse aggregate content on surface chloride concentration is challenging to illustrate clearly using only the Pearson correlation coefficient of true surface chloride concentration values (Fig.8(e) and Fig.8(g)). PDP values show that, in general, annual mean temperature and coarse aggregate content tend to increase surface chloride concentration. However, this trend is relatively weak compared to the influence of the two variables, fine aggregate content, and exposure time, where PDP values reach around 0.10% weight of concrete for annual mean temperature and approximately 0.18% weight of concrete for coarse aggregate content. Nevertheless, the majority of the influence trend of coarse aggregate content lies at the value line of 0.00%, indicating that, overall, these two variables have a negligible impact on surface chloride concentration.

The influence trend of water/binder ratio on surface chloride concentration is depicted by the PDP value as a nonlinear trend, resembling a parabolic curve. Increasing water/binder ratio decreases the surface chloride concentration, with the most significant decrease being approximately 0.05% weight of concrete when water/binder = 0.45. The trend starts to rise again after this ratio (Fig.8(j)). However, it’s noteworthy that the maximum decrease of 0.05% weight of concrete is minimal compared to the most significant influence values caused by fine aggregate content or exposure time (0.4% weight of concrete). Therefore, Shakouri and Trejo’s [12] conclusion that the W/C ratio does not affect surface chloride concentration seems reasonable when comparing the influence trend of water/binder with other analyzed factors.

Fig.8(k) and Fig.8(m) shows the distribution of SF content and FA content with true surface chloride concentration, with Pearson correlation coefficients in Fig.3 being 0.08 for SF content vs. surface chloride concentration and 0.33 for FA content vs. surface chloride concentration. For the use of supplementary cementitious materials in concrete durability design, this result suggests that using FA may improve surface chloride concentration more effectively than SF. PDP values show that, in general, using a certain amount of SF or FA tends to increase surface chloride concentration. For instance, SF used from 0 to 20 kg/m³ increases from 0 to 0.17% weight of concrete, and using an SF content exceeding 20 up to 32 kg/m³, the surface chloride concentration starts to decrease and appears to have no impact on surface chloride concentration after 32 kg/m³ (Fig.8(l)). Similarly, with FA content, increasing FA content from 0 to kg/m³ increases surface chloride concentration from 0 to 0.07% weight of concrete. After this range of FA content, surface chloride concentration starts to decrease (Fig.8(n)). Therefore, it can be concluded that using SF content less than FA content is more effective, aligning with experiments on concrete structures tidal for 16 years [1].

6 Conclusions and perspectives

In this study, we employed nine ML algorithms with default hyperparameters sourced from the Python library to predict the surface chloride concentration of concrete exposed to tidal environments. The selected algorithms include SVR, AdaBoost, RF, XGB, GB, LightGB, DT, KNN, and MARS.

The database consists of 386 samples collected from tidal environments. It includes 12 input variables: cement content, FA content, GGBFS content, SF, superplasticizer content, water content, fine aggregate content, coarse aggregate content, water/binder ratio, exposure time, annual mean temperature of the environment, and chloride concentration of seawater. The performance of each ML model is evaluated using the R² and RMSE, in addition to the reliability assessment technique of 10 repeats of 10-Fold CV.

The performance ranking of ML models is as follows: SVR < MARS < KNN < RF < DT < XGB < AdaBoost < LightGB < GB. The four selected models (AdaBoost, XGB, LightGB, and GB) are employed to predict surface chloride concentration. Among these, the GB ML model, utilizing default hyperparameters, demonstrates superior performance, achieving an R² of 0.920 and RMSE of 0.103% weight of concrete for the testing data set. Moreover, an Excel file for estimating surface chloride concentration is generated from the GB model, providing a convenient tool for mix design based on concrete durability requirements.

SHAP values and PDP 1D analyses offer a comprehensive evaluation of the impact of the 12 variables on surface chloride concentration. The group with the most substantial influence on surface chloride concentration comprises four factors, arranged in decreasing order of impact: Fine aggregate content > Exposure time > Annual mean temperature > Coarse aggregate content. Other groups, including water/binder ratio, SF content, cement content, FA content, water content, superplasticizer content, and chloride concentration of seawater, exert less influence on surface chloride concentration values. Specifically, a linear increase in surface chloride concentration is observed with prolonged exposure time, followed by a stable trend after a certain duration. Notably, an increase in fine aggregate content contributes to a reduction in surface chloride concentration. The Excel file for estimating surface chloride concentration facilitates the exploration of the influence of important variables, encouraging further studies.

Nevertheless, it is of the utmost importance to take note that the ML model that was discovered in this study, in addition to the Excel file that is associated with it, ought to be utilized within the range of values for each input variable that was investigated in this study. In addition, the findings of this research are only applicable to RC structures that are subjected to the tidal ecosystem of seawater. When it comes to the subsequent investigations, ML models need to be constructed in order to make predictions regarding the surface chloride concentration of RC that has been subjected to splash and submerged zones of seawater. In addition, in order to improve the performance of the ML model, the quality of the database ought to be improved by increasing the number of samples and decreasing the amount of imbalanced data.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	de Weerdt K, Orsáková D, Müller A C A, Larsen C K, Pedersen B, Geiker M R. Towards the understanding of chloride profiles in marine exposed concrete, impact of leaching and moisture content. Construction and Building Materials, 2016, 120: 418–431 CrossRef Google scholar

[2]	GjørvO E. Durability Design of Concrete Structures in Severe Environments. London: CRC Press, 2014

[3]	TuuttiK. Corrosion of steel in concrete. Dissertation for the Doctoral Degree. Stockholm: KTH Royal Institute of Technology, 1982

[4]	NguyenL K. Research on high corrosion resistance concrete using silica-fume for building structures exposed to marine environment of Vietnam. Dissertation for the Doctoral Degree. Hanoi: University of Transport and Communications, 2023

[5]	Nguyen L K, Nguyen T T T, Nguyen S T, Ngo T Q, Le T H, Dang V Q, Ho L S. Mechanical properties and service life analysis of high strength concrete using different silica fume contents in marine environment in Vietnam. Journal of Engineering Research, 2024, 12(2): 44–53 CrossRef Google scholar

[6]	Nguyen L K, Nguyen T T T. Forecasting the lifespan of steel–concrete structures in the marine environment by Life-365 software. Transportation Journal, 2021, 3: 92–95

[7]	Angst U, Elsener B, Larsen C K, Vennesland Ø. Critical chloride content in reinforced concrete––A review. Cement and Concrete Research, 2009, 39(12): 1122–1138 CrossRef Google scholar

[8]	Tran V Q. Using a geochemical model for predicting chloride ingress into saturated concrete. Magazine of Concrete Research, 2022, 74(6): 303–314 CrossRef Google scholar

[9]	Ranjith A, Balaji Rao K, Manjunath K. Evaluating the effect of corrosion on service life prediction of RC structures––A parametric study. International Journal of Sustainable Built Environment, 2016, 5(2): 587–603 CrossRef Google scholar

[10]	Song H W, Shim H B, Petcherdchoo A, Park S K. Service life prediction of repaired concrete structures under chloride environment using finite difference method. Cement and Concrete Composites, 2009, 31(2): 120–127 CrossRef Google scholar

[11]	Chalee W, Jaturapitakkul C, Chindaprasirt P. Predicting the chloride penetration of fly ash concrete in seawater. Marine Structures, 2009, 22(3): 341–353 CrossRef Google scholar

[12]	Shakouri M, Trejo D. A study of the factors affecting the surface chloride maximum phenomenon in submerged concrete samples. Cement and Concrete Composites, 2018, 94: 181–190 CrossRef Google scholar

[13]	Gao Y, Zhang J, Zhang S, Zhang Y. Probability distribution of convection zone depth of chloride in concrete in a marine tidal environment. Construction & Building Materials, 2017, 140: 485–495 CrossRef Google scholar

[14]	Cai R, Han T, Liao W, Huang J, Li D, Kumar A, Ma H. Prediction of surface chloride concentration of marine concrete using ensemble machine learning. Cement and Concrete Research, 2020, 136: 106164 CrossRef Google scholar

[15]	Li Q, Li K, Zhou X, Zhang Q, Fan Z. Model-based durability design of concrete structures in Hong Kong–Zhuhai–Macau sea link project. Structural Safety, 2015, 53: 1–12 CrossRef Google scholar

[16]	Marques P F, Costa A, Lanata F. Service life of RC structures: Chloride induced corrosion: Prescriptive versus performance-based methodologies. Materials and Structures, 2012, 45(1–2): 277–296 CrossRef Google scholar

[17]	Nguyen H L, Tran V Q. Data-driven approach for investigating and predicting rutting depth of asphalt concrete containing reclaimed asphalt pavement. Construction & Building Materials, 2023, 377: 131116 CrossRef Google scholar

[18]	Gupta T, Rao M C. Prediction of compressive strength of geopolymer concrete using machine learning techniques. Structural Concrete, 2022, 23(5): 3073–3090 CrossRef Google scholar

[19]	Tran V Q, Dang V Q, Ho L S. Evaluating compressive strength of concrete made with recycled concrete aggregates using machine learning approach. Construction and Building Materials, 2022, 323: 126578 CrossRef Google scholar

[20]	Mende H, Frye M, Vogel P A, Kiroriwal S, Schmitt R H, Bergs T. On the importance of domain expertise in feature engineering for predictive product quality in production. Procedia CIRP, 2023, 118: 1096–1101 CrossRef Google scholar

[21]	Almeida J S. Predictive non-linear modeling of complex data by artificial neural networks. Current Opinion in Biotechnology, 2002, 13(1): 72–76 CrossRef Google scholar

[22]	Janiesch C, Zschech P, Heinrich K. Machine learning and deep learning. Electronic Markets, 2021, 31(3): 685–695 CrossRef Google scholar

[23]	AwadMKhannaR. Machine learning. In: Awad M, Khanna R, eds. Efficient Learning Machines: Theories, Concepts, and Applications for Engineers and System Designers. Berkeley, CA: Apress, 2015, 1–18

[24]	Lundberg S M, Erion G, Chen H, DeGrave A, Prutkin J M, Nair B, Katz R, Himmelfarb J, Bansal N, Lee S I. From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2020, 2(1): 56–67 CrossRef Google scholar

[25]	Karniadakis G E, Kevrekidis I G, Lu L, Perdikaris P, Wang S, Yang L. Physics-informed machine learning. Nature Reviews. Physics, 2021, 3(6): 422–440 CrossRef Google scholar

[26]	Salvati E, Tognan A, Laurenti L, Pelegatti M, de Bona F. A defect-based physics-informed machine learning framework for fatigue finite life prediction in additive manufacturing. Materials & Design, 2022, 222: 111089 CrossRef Google scholar

[27]

Samaniego E, Anitescu C, Goswami S, Nguyen-Thanh V M, Guo H, Hamdia K, Zhuang X, Rabczuk T. An energy approach to the solution of partial differential equations in computational mechanics via machine learning: Concepts, implementation and applications. Computer Methods in Applied Mechanics and Engineering, 2020, 362: 112790

CrossRef Google scholar

[28]	Zhuang X, Guo H, Alajlan N, Zhu H, Rabczuk T. Deep autoencoder based energy method for the bending, vibration, and buckling analysis of Kirchhoff plates with transfer learning. European Journal of Mechanics. A, Solids, 2021, 87: 104225 CrossRef Google scholar

[29]	Guo H, Yin Z Y. A novel physics-informed deep learning strategy with local time-updating discrete scheme for multi-dimensional forward and inverse consolidation problems. Computer Methods in Applied Mechanics and Engineering, 2024, 421: 116819 CrossRef Google scholar

[30]	LundbergS MLeeS I. A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems 30 (NIPS 2017). New York, NY: Curran Associates, Inc., 2017

[31]	Friedman J H. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 2001, 29(5): 1189–1232 CrossRef Google scholar

[32]	NguyenQ HLyH BHoL SAl-AnsariNLeH VTranV QPrakashIPhamB T. Influence of data splitting on performance of machine learning models in prediction of shear strength of soil. Mathematical Problems in Engineering 2021, 2021: e4832864

[33]	Cortes C, Vapnik V. Support-vector networks. Machine Learning, 1995, 20(3): 273–297 CrossRef Google scholar

[34]	Zhang Y, Zhang H, Cai J, Yang B. A weighted voting classifier based on differential evolution. Abstract and Applied Analysis, 2014, 2014: e376950 CrossRef Google scholar

[35]	Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997, 55(1): 119–139 CrossRef Google scholar

[36]	Breiman L. Random forests. Machine Learning, 2001, 45(1): 5–32 CrossRef Google scholar

[37]	AyyadevaraV K. Gradient boosting machine. In: Ayyadevara V K. Pro Machine Learning Algorithms. Berkeley, CA: Apress, 2018, 117–134

[38]	Friedman J H. Stochastic gradient boosting. Computational Statistics & Data Analysis, 2002, 38(4): 367–378

[39]	LightGBM. LightGBM’s Documentation Version 3.2.1.99, 2021

[40]	Quinlan J R. Induction of decision trees. Machine Learning, 1986, 1(1): 81–106 CrossRef Google scholar

[41]	Piryonesi S M, El-Diraby T E. Role of data analytics in infrastructure asset management: Overcoming data size and quality problems. Journal of Transportation Engineering: Part B, Pavements, 2020, 146(2): 04020022 CrossRef Google scholar

[42]	HastieTTibshiraniRFriedmanJ. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York, NY: Springer-Verlag, 2009

[43]	Friedman J H. Multivariate adaptive regression splines. The Annals of Statistics, 1991, 19(1): 1–67 CrossRef Google scholar

[44]	Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. . Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 2011, 12: 2825–2830

[45]	BrownleeJ. XGBoost with Python: Gradient Boosted Trees with XGBoost and Scikit-Learn. San Francisco, CA: Machine Learning Mastery, 2016

[46]	MolnarC. SHAP (SHapley Additive exPlanations). In: Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. Morrisville, NC: Lulu Press, 2022

[47]	Lin S, Zheng H, Han B, Li Y, Han C, Li W. Comparative performance of eight ensemble learning approaches for the development of models of slope stability prediction. Acta Geotechnica, 2022, 17(4): 1477–1502 CrossRef Google scholar

[48]	Lin S, Liang Z, Zhao S, Dong M, Guo H, Zheng H. A comprehensive evaluation of ensemble machine learning in geotechnical stability analysis and explainability. International Journal of Mechanics and Materials in Design, 2024, 20(2): 331–352 CrossRef Google scholar

[49]	Lin S, Liang Z, Dong M, Guo H, Zheng H. Imbalanced rock burst assessment using variational autoencoder-enhanced gradient boosting algorithms and explainability. Underground Space, 2024, 17: 226–245 CrossRef Google scholar

[50]	Shakouri M, Trejo D. A time-variant model of surface chloride build-up for improved service life predictions. Cement and Concrete Composites, 2017, 84: 99–110 CrossRef Google scholar

Competing interests

The authors declare that they have no competing interests.

RIGHTS & PERMISSIONS

2025 Higher Education Press

AI Summary AI Mindmap

PDF(4973 KB)

194

Accesses

Citations

Detail

Sections

Recommended

Abstract
Graphical abstract
Keywords
Cite this article
1 Introduction
2 Methodology flowchart of Machine Learning approach
Fig.1 Methodology flowchart of this investigation.
3 Data description
Tab.1 Statistical values of database including Mean, Median, Minimal, Maximal, Quartile 25%, Quartile 75%, and Standard Deviation (StD)
Fig.2 Histogram data of each variable: (a) cement content; (b) FA content; (c) GGBFS content; (d) SF content; (e) superplasticizer content; (f) water content; (g) fine aggregate content; (h) coarse aggregate content; (i) water/binder; (j) exposure time; (k) annual mean temperature; (l) chloride concentration of seawater; (m) surface chloride content.
Fig.3 Matrix correlation value for all variables.
4 Machine Learning approach
4.1 Machine Learning algorithms
4.1.1 Support Vector Regression
4.1.2 Adaptive Boosting Regressor
4.1.3 Random Forest
4.1.4 Gradient Boosting
4.1.5 Extreme Gradient Boosting
4.1.6 Light Gradient Boosting Machine
4.1.7 Decision Tree
4.1.8 K-Nearest Neighbors
4.1.9 Multivariate Adaptive Regression Splines
4.2 Performance metrics for evaluating Machine Learning model
4.3 Repeated K-Fold Cross-Validation
Fig.4 Schema of 10-Fold cross validation and 10 repeats of 10-Fold cross validation.
4.4 Input variable effect using Shapley Additive Explanations and Partial Dependence Plot
5 Results and discussion
5.1 Performance analysis of Machine Learning model
Fig.5 Boxplot of 10 repeats of 10-Fold CV for R2 value of 9 ML models: (a) SVR; (b) AdaBoost; (c) RF; (d) XGB; (e) GB; (f) LightGB; (g) DT; (h) KNN; (i) MARS.
Tab.2 Value of R2 in each time of 10 repeats of 10-Fold CV for 9 ML models
Tab.3 Value of StD in each time of 10 repeats of 10-Fold CV for 9 ML models
Tab.4 Mean value of R2 and Mean value of StD of 10 repeats of 10-Fold CV for 9 ML models
5.2 Prediction of surface chloride concentration using four best Machine Learning models
Fig.6 Comparison of surface chloride concentration predicted by 4 ML models: (a) AdaBoost; (b) XGB; (c) LightGB; (d) GB.
5.3 Identification of importance feature on surface chloride concentration
5.3.1 Shapley Additive Explanations of predicted surface chloride concentration
Fig.7 Quanlitatively and Quantitatively effect of 12 input variables on surface chloride concentration of concrete: (a) SHAP value based on LightGB model; (b) SHAP value based on GB model.
Tab.5 Performance of GB model with using different input variables
5.3.2 Quantitative factor effect of input variables by Partial Dependence Plot
Fig.8 Effect of factors on surface chloride concentration of concrete: (a) fine aggregate content and surface chloride described by simple correlation; (b) fine aggregate content and surface chloride described by PDP 1D; (c) exposure time and surface chloride described by simple correlation; (d) exposure time and surface chloride described by PDP 1D; (e) annual mean temperature and surface chloride described by simple correlation; (f) annual mean temperature and surface chloride described by PDP 1D; (g) coarse aggregate content and surface chloride described by simple correlation; (h) coarse aggregate content and surface chloride described by PDP 1D; (i) water/binder and surface chloride described by simple correlation; (j) water/binder and surface chloride described by PDP 1D; (k) SF content and surface chloride described by simple correlation; (l) SF content and surface chloride described by PDP 1D; (m) FA content and surface chloride described by simple correlation; (n) FA content and surface chloride described by PDP 1D.
6 Conclusions and perspectives
References
Competing interests
RIGHTS & PERMISSIONS

Received	Accepted	Published
27 Mar 2024	28 May 2024	15 Feb 2025
Issue Date
24 Feb 2025

About the journal

Browse

Authors & reviewers

Abstract

Graphical abstract

Keywords

Cite this article

1 Introduction

2 Methodology flowchart of Machine Learning approach

Fig.1 Methodology flowchart of this investigation.

3 Data description

Tab.1 Statistical values of database including Mean, Median, Minimal, Maximal, Quartile 25%, Quartile 75%, and Standard Deviation (StD)

Fig.3 Matrix correlation value for all variables.

4 Machine Learning approach

4.1 Machine Learning algorithms

4.1.1 Support Vector Regression

4.1.2 Adaptive Boosting Regressor

4.1.3 Random Forest

4.1.4 Gradient Boosting

4.1.5 Extreme Gradient Boosting

4.1.6 Light Gradient Boosting Machine

4.1.7 Decision Tree

4.1.8 K-Nearest Neighbors

4.1.9 Multivariate Adaptive Regression Splines

4.2 Performance metrics for evaluating Machine Learning model

4.3 Repeated K-Fold Cross-Validation

Fig.4 Schema of 10-Fold cross validation and 10 repeats of 10-Fold cross validation.

4.4 Input variable effect using Shapley Additive Explanations and Partial Dependence Plot

5 Results and discussion

5.1 Performance analysis of Machine Learning model

Fig.5 Boxplot of 10 repeats of 10-Fold CV for R2 value of 9 ML models: (a) SVR; (b) AdaBoost; (c) RF; (d) XGB; (e) GB; (f) LightGB; (g) DT; (h) KNN; (i) MARS.

Tab.2 Value of R2 in each time of 10 repeats of 10-Fold CV for 9 ML models

Tab.3 Value of StD in each time of 10 repeats of 10-Fold CV for 9 ML models

Tab.4 Mean value of R2 and Mean value of StD of 10 repeats of 10-Fold CV for 9 ML models

5.2 Prediction of surface chloride concentration using four best Machine Learning models

Fig.6 Comparison of surface chloride concentration predicted by 4 ML models: (a) AdaBoost; (b) XGB; (c) LightGB; (d) GB.

5.3 Identification of importance feature on surface chloride concentration

5.3.1 Shapley Additive Explanations of predicted surface chloride concentration

Fig.7 Quanlitatively and Quantitatively effect of 12 input variables on surface chloride concentration of concrete: (a) SHAP value based on LightGB model; (b) SHAP value based on GB model.

Tab.5 Performance of GB model with using different input variables

5.3.2 Quantitative factor effect of input variables by Partial Dependence Plot

6 Conclusions and perspectives

{{custom_sec.title}}

{{custom_sec.title}}

References

Competing interests

RIGHTS & PERMISSIONS

Fig.5 Boxplot of 10 repeats of 10-Fold CV for R² value of 9 ML models: (a) SVR; (b) AdaBoost; (c) RF; (d) XGB; (e) GB; (f) LightGB; (g) DT; (h) KNN; (i) MARS.

Tab.2 Value of R² in each time of 10 repeats of 10-Fold CV for 9 ML models

Tab.4 Mean value of R² and Mean value of StD of 10 repeats of 10-Fold CV for 9 ML models