1 Introduction
Overconsolidation ratio (OCR) is a principal geotechnical parameter that reflects the history of soil stress and is an important index to measure soil stability and deformation characteristics. OCR can be measured by the laboratory consolidation test. However, this test is time-consuming and expensive, especially when the continuous profile of OCR is required [
1].
Cone penetration testing with pore pressure measurement (CPTU or Piezocone testing) is a rapid economical in situ testing method for providing continuous accurate profiles and determining a range of geotechnical properties, including OCR. During CPTU, cone tip resistance, sleeve friction, and excess pore water pressure are measured simultaneously. Based on these measurements, a number of analytical [
2–
8] and empirical [
5,
8–
14] relationships have been proposed to estimate the OCR of clayey soils. However, since the OCR belongs to the soil’s state characteristic index and the CPTU test parameters represent its mechanical index, the relationship between the two is not linear but rather complex. Therefore, the applicability of these relationships is limited to the region where their correlations were established [
15].
A rapidly evolving area of artificial intelligence called machine learning (ML) focuses on developing algorithms able to find complex patterns in data and apply these patterns to generate informed forecasts or decisions. Recent ML models have shown extraordinary promise in analyzing structured, tabular data sets, where relationships between variables can be difficult to capture using conventional statistical techniques. These models are especially useful for addressing data-driven challenges in many scientific and engineering fields since they can find hidden patterns and trends by learning from existing data [
16,
17]. ML methods have also been successfully utilized to predict different geotechnical properties [
18–
20]. OCR, in particular, has been predicted using ML approaches in recent years [
1,
21–
23]. Tab.1 presents a summary of previous studies on the prediction of OCR of clays using ML methods.
This research aims to examine and compare the performance of five well-known ML methods, i.e., gradient boosting machine (GBM), random forest (RF), artificial neural network (ANN), support vector machine (SVM), and eXtreme gradient boosting (XGB) in predicting the OCR values of clays. It is the first time that GBM, RF, and XGB models have been used to predict clay OCR. Moreover, by integrating the hyper-parameter tuning (optimization) technique coupled with the k-fold cross-validation approach into the ML algorithms, cutting-edge models were developed. A large set of records collected throughout the world was used to train and test the developed models. The database includes data collected from both intact and fissured clay sites. Another key contribution of this work is that by using an encoding procedure, clay type was integrated as an input parameter, further enhancing the models’ predictive capabilities. Furthermore, this study examined the influence of different CPTU configurations, i.e., single-element, double-element, and triple-element, on the performance of ML models, thereby providing a comprehensive evaluation of their effectiveness under varying conditions. After selecting the proper input features, the performance of each model was examined by comparing the predicted OCR values with their reference counterparts, obtained from the oedometer test, in terms of different metric measures. Finally, the OCR values predicted by ML models were compared with those obtained from existing relationships. By addressing these aspects, our research advances state-of-the-art in geotechnical engineering and offers a reliable, fast and cost-effective solution for predicting OCR.
2 Data acquisition
Chen [
24] gathered a large set of CPTU data and their counterpart OCR values measured by the oedometer test from the literature. After removing outliers and NAN records, a total of 488 records of data from intact and fissured clay sites were extracted from this database to train and test the ML models. 75% of the data was used for training, and the remaining 25% was used for testing the model. This study used vertical total stress (
σv0), hydrostatic pore water pressure (
u0), excess pore water pressures at one or more locations (at the cone tip (
u1), above the cone base (
u2), and along the sleeve (
u3), corrected cone resistance (
qt), and the type of clay (intact or fissured) as input data.
A preliminary study was performed to address the influence of excess pore water pressure measurements on the efficiency of ML methods. Since the previous literature only considered u1 and u2 elements, this research evaluated the effect of all three elements, i.e., u1, u2, and u3, together with their combinations, on the predicted OCR values. This led to a total of 7 scenarios, involving single-element, double-element, and triple-element CPTU data presented in Tab.2. From the main database, 206 data sets that include the records of all three pore water pressure measurements were taken into account. 75% and 25% of these data were used for training and testing respectively. The results implied that using double-element CPTU sounding with u1 and u2 elements generally improves the performance of the models. So, it was decided to choose u1 and u2 as additional input parameters. This choice also coincides with past studies and provides a larger set of data to be used for developing ML models. Thus, scenario IV with six inputs was finally used for training and testing ML methods.
In Tab.3, the values of some sample data are presented. Tab.4 gives statistical details on the input and output parameters used in this research. It is worth noting that later on, utilizing the Pandas dummy algorithm in Python, the clay type was divided into two columns (one-hot encoding) to make the data digestible for the ML models.
3 Machine learning methods
ML techniques may be divided into three categories: reinforcement, supervised, and unsupervised learning. Only the supervised technique was used in this investigation because the target characteristic was known. A method for creating AI models called “supervised learning” involves training a computer algorithm on input data that has been tagged for a certain output. This section explains the five supervised ML algorithms used in this research.
3.1 Artificial neural network
ANN has emerged as a significant area of interest within the engineering field, particularly in applications involving extensive data sets [
25,
26]. An ANN models by mimicking the human nervous system. To forecast nonlinear correlations between input and output data, it is used. Many neural network models employ backpropagation networks and variants in the practical applications of ANNs [
27]. Generally, the neurons are built and initialized with random weights and biases, then a loss function such as mean squared error (
MSE) calculates the difference between real values and predictions. Next, an optimization function like stochastic gradient descent (SGD) alters the neuron weights and biases to minimize the error/loss. The storing process is also convenient as a single training sample is processed at a time by the neural network [
28]. Generally, the objective function, as follows, is required to be minimized:
The parameter λ should be selected in such a way that minimizes L (λ). Each summand function Li is usually associated with the ith observation in the training data set. Considering SGD, the true gradient of L(λ) is calculated via one gradient as an example:
where η is the learning rate. The algorithm updates each training example as it sweeps across the training set; several epochs are passed until the model converges. Data are shuffled after each epoch to make sure of convergence.
A fully connected neural network has three main groups of neurons including: input layer, hidden layers, and output layer. The number of neurons in the input layer and output layer is equal to the number of input/output features, whereas the hidden layers needed to be designed and optimized carefully to be able to capture the nonlinear relation between the given inputs and the output.
Usually, to have a more efficient ANN model, between layers of neurons, other types of layers such as activation functions and batch normalizations are used. Activation functions such as ReLu provide a nonlinear connection from one set of neurons to another that helps to increase the accuracy of the ANN while facing nonlinear problems. The batch normalization layer is normalizing the output of neurons before sending them to the next layer which helps increase the training efficiency by reducing the training time and improving the accuracy.
To make sure the developed model is an optimum solution, different designing methods are suggested that mostly known as grid search. A full grid search means developing all possible neural networks and checking their accuracy on a given data, which is a very time-consuming method. In this study, the AutoKeras package which is implemented on top of Tensorflow was used for designing the ANN architecture. AutoKeras is searching over model hyper-parameters via genetic algorithm to suggest a considerably efficient model within a limited number of trials [
29].
After trying over 200 models 5 times, the AutoKeras reached a model similar to Fig.1, later on, we improved the model and implemented it again via Pytorch library. The proposed model (Fig.1) with 32 neurons in hidden layer 1 and 512 neurons in hidden layer 2 and several ReLu and batch normalization layers showed the best accuracy among our trials and AutoKeras suggestions.
3.2 Support vector machine
SVM is a supervised ML algorithm used for both classification and regression tasks, finding an optimal hyperplane (or line in 2D) that best separates data points into different classes, maximizing the margin between them [
30]. Fig.2 illustrates the diagram of the SVM method. In the developed SVM model, the kernel determines how the data are transformed into a higher-dimensional space (polynomial in this study), the degree controls the complexity of the polynomial function, and the tolerance (tol) defines the stopping criterion for the optimization process.
3.3 Random forest
To produce high-quality output, ensemble learning integrates the predictive impacts of many algorithms. The RF algorithm uses bootstrap aggregation to create decision trees and is an ensemble approach. Using a non-parametric regression technique, prediction rules may be created without making any explicit previous assumptions about how the predictor and output would be combined. A compilation of all the produced decision trees’ predictions serves as this algorithm’s ultimate output. This approach allows for nearly equal treatment of all data dimensions and avoids significant tree correlation [
31].
For the creation of a tailored RF model, there are several key hyper-parameters. The first significant parameter that affects the ensemble forest’s growth (high-level control) and sets the required number of decision trees (generated from a random selection of data) is the number of estimators (n_estimators). There are two hyperparameters that control the algorithm at a lower level (controlling decision tree formation). (min_samples_leaf) specifies the minimum number of samples required to create a node in the decision tree (DT), and regulates node growth. In the same manner, max depth determines the depth of the DT. If this argument is omitted, nodes are enlarged until all leaves are pure or until all leaves contain fewer samples than the specified minimum number of samples (min_samples_split). When conducting RF trial-and-error experiments, it’s crucial to keep in mind that greatly extended trees with high depth are biased by the fact that they were overdeveloped on the basis of randomly chosen data. Additionally, a large forest with several estimators may need a lot of computing resources with little to no improvement in model accuracy.
3.4 Gradient boosting machine
A tree-based ML model called a GBM is an improved version of RF. The GBM and RF are fundamentally different based on the ensemble formation building tree approach. To decrease the error in predicting the target variables, additional trees are introduced to the boosting method. The estimation error is decreased until the model’s maximum accuracy is attained by adding a new tree to the GBM structure with a constant learning factor [
32].
For the creation of GBM models, hyper-parameters are crucial and have a significant impact on model accuracy. In the GBM algorithm, new trees are added while the loss function (squared error) is optimized. In addition to the RF hyper-parameters, GBM provides a learning rate parameter that shows how quickly each tree’s contribution is decreasing when new ones are being developed. The ideal values of the hyper-parameters were discovered by trial and error, a random grid search, and consideration of the relevant range of the hyper-parameter.
3.5 Extreme gradient boosting
XGB, short for Extreme Gradient Boosting [
33], is a powerful and efficient ML library designed for scalable and distributed gradient-boosted decision trees (GBDT). It is widely recognized for its high performance and flexibility, making it a top choice for solving regression, classification, and ranking problems. With built-in support for parallel tree boosting, XGB optimizes both speed and accuracy, enabling faster model training and improved predictive performance. Additionally, it offers advanced regularization techniques, robust handling of missing values, and integration with popular ML frameworks, making it a preferred tool for data scientists and ML practitioners. In comparison to GBM, XGB has a subsample parameter which defines the proportion of the training data to be randomly selected for each boosting round. Subsampling happens during every boosting iteration.
Fig.3 shows the formation strategy and shape for RF, GBM, and XGB models. Also, the selected hyper-parameters for ML models are given in Tab.5.
4 Preprocessing and data investigation
To explore and prepare the data, python libraries such as matplotlib, pandas, and seaborn were used. After removing the outliers (less frequent records) and records with Nan values, the one-hot encoding was performed on a column (clay type) to transform the string (categorical) values into a numerical array. Later on, a Pearson correlation matrix was developed to investigate the OCR and inputs, as well as the correlation between inputs (multicollinearity). Tab.6 shows the correlation coefficients between the features of the data. In this table, the feature values range from −1 to 1, with −1 representing a strongly negative correlation and 1 representing a highly positive correlation between the variables in the data set. As can be observed, vertical stress and hydrostatic pressure have a low negative correlation with OCR, while, as was expected, clay type and cone resistance are strongly correlated with OCR. Moreover, the excess pore pressure at the cone tip has a higher correlation with the OCR than that above the cone base.
5 Results and discussion
5.1 Prediction of Overconsolidation ratio using machine learning methods
Tab.7 summarizes the performance of five prediction models for both testing and training data. Mean absolute error (MAE), root mean squared error (RMSE), and the coefficient of determination (R2) are used as model metrics to evaluate the quality of methods. SVM is a simpler model that cannot fully comprehend the complexity of OCR prediction, and it reached the R2 values of 0.83 for the test set, whereas the XGB reached a much higher R2 at 0.94 and low RMSE of 2.50. Following XGB, the RF, GBM, and ANN models also delivered strong performance, all achieving R2 values equal to or above 0.90 on the test data, indicating high effectiveness in predicting the OCR of clays. Additionally, there is an insignificant difference between the train and test results, indicating that none of the models have an overfitting problem.
The graphs in Fig.4 compare the values of OCR values obtained from proposed models (prediction) with those obtained from the oedometer test (real) for both training and test data sets. In these graphs, the more the points accumulate around the red line, the more the predictions are accurate. As can be seen, when the OCR values are less than around 10, the models perform accurately. But for clays with higher OCR values, which is an attribute of fissured clays, the model’s accuracy decreases. Two main reasons could address this issue. First, there is a lack of available data for the fissured clays, causing the data sets to be not large enough for training properly. The second reason for this inaccuracy stems from drawbacks associated with the sampling (inability of the samples to be a true representative of the field fissured clay) and the testing procedure (the influence of cracks on the shape of the consolidation curve, and accordingly, the measured OCR) of the fissured clay [
1].
Fig.5 depicts the box and whisker plot of the OCR prediction error (difference between measured and predicted values) for ML models. The median is represented by the line in the center of each box, while the quartiles are indicated by the box’s borders. The length of whiskers is 1.5 times that of the interquartile range, and the points outside of the whiskers are outliers. In general, the errors for all models are centered around zero and equally distributed, indicating that the models are unbiased. SVM shows a larger range of outliers and has the largest whisker box of errors in the training set, whereas the XGB model performed very well on the training data with a marginal error. XGB also demonstrated a reasonably good prediction over the test data set. The interquartile range for RF and GB models is within (+2,−2) over both the test and training data, indicating that they are highly reliable. For ANN, the interquartile range is within about ( +1.5,−4) over the test data set, showing poorer performance than other models. Furthermore, as seen by this plot, none of the models have been over-fitted.
Taylor diagram, which is a graphical framework for comparing test data sets with the observed (referenced) data set, is shown in Fig.6. In this diagram, each model is represented by a colored circle, while the orange contours represent the centered root mean square values. The location of each circle in the Taylor diagram indicates how well that model corresponds to the measurements. The model that most closely matches observations will be located closest to the x-axis point labeled “Ref”. As a basic model, SVM shows a far standard deviation indicating the predictions are not distributed well around the real values. On the other hand, XGB shows the best performance aligning both the distribution and precise correlation (> 96%) with the real values. Moreover, ANN has almost the same standard deviation as the observations, so it simulates the amplitude of variations better than the other models.
5.2 Sensitivity analysis
To investigate the performance of models in more detail, a sensitivity analysis was performed. This analysis is an efficient tool to assess the influence of input features on the model’s prediction. The Sobol approach, a variance-based sensitivity analysis, is applied in this study. The Saltelli sampler is used as a sampling technique. Using Eq. (3), the space for creating samples is set up:
where
N is the argument supplied to define the size of sample data, and
D is the number of model inputs (six in this study). Thus, 14336 samples were created for the current data set and submitted to the Sobol algorithm using the SALib library [
34–
36].
Fig.7 shows the results of Sobol analysis computing the first-order indices. According to this figure, for the RF, GBM, XGB, and SVM models, corrected cone resistance is by far the most significant factor influencing the OCR predictions. Concerning ANN, it can be inferred that its predictions are impacted by all features, noting that pore pressure above the cone base has the lowest contribution to the results.
5.3 SHAP analysis
SHAP (SHapley Additive exPlanations) is a powerful method used to interpret ML models by assigning important values to each feature based on their contribution to a specific prediction. SHAP values are derived from cooperative game theory, ensuring a fair distribution of feature contributions across different possible feature combinations. A positive SHAP value indicates that a feature pushes the model’s prediction higher, whereas a negative SHAP value lowers it. The magnitude of the SHAP value represents the strength of the feature impact on the output of the model.
In Fig.8, the SHAP values of different features used in an XGB model for OCR soil prediction. Each point represents an individual observation, where the x-axis denotes the SHAP value (impact on model output), and the y-axis lists the input features. The color gradient, ranging from blue (low feature values) to red (high feature values), indicates the magnitude of each feature value. Features such as type of clay and qt show a wider spread in SHAP values, suggesting their significant impact on the predictions of the model. Positive SHAP values for high feature values (red points) imply that these features strongly contribute to increasing the prediction, whereas negative SHAP values for high feature values suggest a decreasing effect. The figure effectively demonstrates how different soil-related parameters influence the XGB model’s decision-making process, highlighting the most influential variables in soil classification. While clay type and qt are the most significant features of the data to determine OCR accurately, σv0 and u0 in some records are decreasing the prediction and are affecting the results substantially.
5.4 Comparison with existing methods
As mentioned in the introduction section, several relationships are currently available for obtaining the OCR clays. Based on the data used in this study, four well-known relationships [
6,
11,
12] are chosen to compare with the ML models. Tab.8 presents the results achieved using the four traditional approaches for the testing data set in terms of
R2,
MAE, and
RMSE. Comparing these results with those obtained from the ML models (Tab.7) indicates that the ML models perform significantly better and provide more reliable results.
6 Conclusions
This study utilized an advanced form of five commonly-used ML models to predict the OCR values of fissured and intact clays. The following conclusions are drawn from the results.
1) The preliminary study showed that using double-element CPTU soundings with u1 and u2 elements provided better OCR predictions.
2) The models demonstrated accurate performance when the OCR values were below approximately 10.
3) The accuracy of the models decreased when predicting OCR for clays with higher values (> 10), typically indicating the presence of fissures.
4) Sensitivity analysis indicated that cone resistance (qt) played a major role in the predictions of RF, GBM, SVM, and XGB models, whereas ANN predictions were influenced by all parameters almost equally.
5) The comparative analysis of the models revealed that SVM was less effective in capturing the complex patterns associated with OCR prediction. On the other hand, ensemble-based methods, particularly XGB, demonstrated superior accuracy. RF, GBM, and ANN also maintained strong and consistent performance, reinforcing their suitability for reliable OCR estimation in clay soils.
6) A comparison with existing empirical relationships for OCR estimation showed that these traditional approaches performed significantly worse than the proposed ML models.
7) While the developed models demonstrated strong predictive performance, a more detailed analysis of model performance across different geological conditions and clay types (e.g., high-plasticity vs. low-plasticity clays) would provide additional insights into the models’ universality. Future studies with larger data sets could allow for statistically significant comparisons of regional and material-specific variations.
The Author(s). This article is published with open access at link.springer.com and journal.hep.com.cn