1 Introduction
Data-driven approaches in materials science significantly accelerate material discovery and optimization, reducing costs and enhancing the success rate of material development [
1–
5]. Symbolic regression is an effective data-driven modeling technique that automatically discovers mathematical expressions from data, representing relationships between variables [
6–
12]. Among the numerous symbolic regression algorithms, the Sure Independence Screening and Sparsifying Operator (SISSO) [
13] introduced by Ouyang
et al. has attracted people’s attention. It can extract material descriptors from large datasets, which is applicable across various fields of study [
8,
14,
15]. It generates a large feature space from the original space, and then selecting features from this new space to build models with compressed sensing algorithm.
Consequently, SISSO’s requirement to create and partially store the entire expression space leads to exponentially growing memory demands with increasing features and complexity, making it resource-intensive for complex problems. To address this issue, Ouyang
et al. [
16] developed VS-SISSO, which combines symbolic regression with iterative variable selection (random search) [
17,
18] to optimize the model with numerous input features [
19–
21]. Alternatively, Xu and Qian [
22] proposed i-SISSO [
23], which integrates mutual information (MI) and minimum redundancy maximum relevance (mRMR) algorithms [
24] to optimize feature combinations for maximum relevance and minimum redundancy. Obviously, the number of input features directly affects the performance of SISSO, VS-SISSO, or i-SISSO, as all input features are considered in the SO process. Therefore, we propose a more straightforward idea that performing a prescreening to get the important features for an effective input of SISSO may reduce the computational complexity and storage costs, and save time. The decision tree model [
25] can describe nonlinear relationships efficiency for datasets of different sizes, which may help us to realize the above idea. Since a single tree model may result in high variability, it is believed the Random Forests algorithm [
26,
27] should be a better choice for the prescreening of features.
On the other hand, it would be expensive and time-intensive in the experiments to gather extensive datasets, especially for physics, chemistry, material, and life science. The lack of abundant data may reduce the performance of machine learning algorithms for those areas, in particular exploring new phenomena. Random Forests generate multiple tree models based on subsets obtained from bootstrapped samples, then vote on data importance. This resampling method effectively enlarges datasets, which is naturally friendly to the scientific researches with limited data. It would be expected to combine Random Forests with SISSO for those researches.
Herein, we combine Random Forests with SISSO to describe certain nonlinear relationships, resulting in RF-SISSO. Taking the SISSO’s verification problem for 299 materials [
13,
28] as an example, RF-SISSO and SISSO were compared on training datasets of various sizes. The training sample sizes of 224, 150, 75, and 45 were randomly selected from the dataset of 299 materials, while the other samples’ data were used for testing in each case. For five parallel testing on each training sample size cases, RF-SISSO maintained a predicting accuracy above 0.9 in all cases, whereas SISSO’s accuracy was below 0.9 in the 45-sample subsets. As the prescreening by Random Forests effectively reduce the number of input features, RF-SISSO’s descriptor regression efficiency was higher than that of original SISSO in all cases, notably reducing time costs. Meanwhile, the RF-SISSO algorithm were also compared with i-SISSO algorithm [
22], where RF-SISSO demonstrated higher regression efficiency than that of i-SISSO with a comparable accuracy.
2 Method
2.1 Experimental dataset and environments
SISSO and RF-SISSO were tested on the
classification problem of metal/non-metal binary materials using experimental data collected by the authors who developed SISSO [
13,
28]. The data sources included the WebElements (atomic) and SpringerMatters (structural) databases.
The features analyzed were:
Pauling electronegativity
Ionization energy
Covalent radius
Electron affinity
Number of valence electrons
Coordination number
Interatomic distance
Atomic composition
Packing fraction
Combining data from WebElements and SpringerMatters resulted in 15 prototypes, covering a total of 299 materials and 16 features. The testing was conducted on hardware configured with an Intel(R) Xeon(R) Gold 5218 CPU @ 2.30 GHz, 128 GB memory, and running on 4 cores.
2.2 Experimental process
2.2.1 Feature evaluation using Random Forests
The features of all materials were evaluated using the Random Forests algorithm. The importance of each feature was determined using the Gini coefficient.
Due to the small dataset, both sample division and Random Forests selection were randomized, which led to varying prediction accuracy for each cross-validation and potential experimental contingencies. To mitigate this, feature assessment was repeated at least 50 times using the Random Forests algorithm.
The importance scores for each feature were obtained and ranked. These scores, once normalized, allowed for a clear comparison of feature importance. The features, those with an importance score of 5 or higher, were selected for the SISSO calculation.
2.2.2 Operator regression using SISSO
After identifying important features with Random Forests, SISSO was employed to filter key features further. This process yielded two-dimensional descriptors for classification through symbolic regression.
The total sample set was divided into four subsets, with training samples’ sizes of 224, 150, 75, and 45. Each subset was randomly selected five times. The flowchart of the RF-SISSO algorithm is shown in Fig.1.
RF-SISSO selects the top 8 features with importance scores of 5 or higher following the feature ranking by Random Forests. SISSO, without the integration of Random Forests, continues to be represented by the original 16 features.
The accuracy of the descriptors obtained from original SISSO and RF-SISSO regressions were evaluated using the Support Vector Classifier (SVC) [
29].
For detailed principles of algorithm application, please refer to the supporting information and original literature [
13,
26,
27].
3 Result
3.1 Random Forests algorithm feature screening and performance of SISSO (SISSO & RF-SISSO) in datasets of varying sizes
The study evaluated the importance of features using the Random Forests algorithm. Fig.2(a) shows a histogram of the features affecting metal/nonmetal properties as determined by the Random Forests analysis. The top eight important features with feature importance ratings above 5 were identified: the electronegativity of the “A” atom and “B” atom (), packing fraction (), valence electron number of the “A” atom and “B” atom (), electron affinity energy of the “A” atom and “B” atom (), and ionization energy of the “B” atom ().
To compare the accuracies of models from SISSO and RF-SISSO, we used the obtained 2D descriptors as training features to classify the metal/nonmetal materials using the SVC classifier and recorded their classification accuracies. For a simple presentation, the accuracy of SISSO/RF-SISSO is used to represent the accuracy of 2D descriptors from SISSO/RF-SISSO. As shown in Fig.2(b), the accuracies of both SISSO and RF-SISSO are above 0.9 with the 224-sample subset, where SISSO generally performs slightly better. However, RF-SISSO’s accuracy tends to increase as the dataset size decreases. With the 150-sample subset, RF-SISSO outperformed SISSO in two training sets and matched SISSO in one testing set. With 75-sample subset, RF-SISSO had one training set significantly higher than SISSO and outperformed SISSO in three testing sets. For the 45-sample subset, RF-SISSO consistently outperformed SISSO, with all training accuracies significantly higher and only one testing set lower. In this case, two training sets of SISSO have accuracy lower than 0.9, which demonstrates RF-SISSO’s advantage in achieving high accuracy with smaller datasets. Here, RF-SISSO can achieve higher accuracy than SISSO is mainly attributed to RF’s ability to capture complex nonlinear relationships and interactions between features and target properties. By identifying and removing irrelevant or redundant features with RF prescreening, the efficiently reduced feature space allows SISSO to operate more efficiently and focus on the most informative features. It helps to mitigate the risk of overfitting, as the model is less likely to learn noise from irrelevant features.
The major advantage of RF-SISSO over SISSO is its shorter operator regression time and higher training efficiency. Fig.2(c) and (d) show that the average operator regression time of SISSO is about 26 times longer than that of RF-SISSO for the 224-sample and 150-sample subsets, 39 times longer for the 75-sample subset, and up to 265 times longer for the 45-sample subset. This demonstrates that by using RF to eliminate redundant features, the regression efficiency of SISSO is effectively improved. In addition, we used the Random Forests algorithm alone to classify the data. As shown in Table S1, the testing accuracy of RF is not as good as SISSO, but the computing time is much shorter. The complementarity between RF and SISSO make RF-SISSO to have a more efficient processing of high-dimensional datasets to construct well-generalized accurate models.
The two-dimensional descriptors obtained by symbolic regression of SISSO and RF-SISSO under different dataset scales were presented in Tab.1. The descriptors were derived from the set of data with the highest accuracy in the five parallel sets for each sample size. Comparing the descriptors of SISSO and RF-SISSO across different data scales reveals that the complexity of RF-SISSO descriptors is consistently lower than that of SISSO. Notably, the selected features may vary in each experiment due to differences in the training set, even when the number of initial features is not limited in SISSO. The features in the RF-SISSO descriptor are similar to those of SISSO but more meaningful in physics, where the combinations like and recurs. It indicates that stronger electronegativity leads to increased exclusivity rather than sharing of charge, making it easier to form non-metals. Additional descriptors are available in the supporting information (Tables S2−S5).
The results of metal/non-metal classification using the SVC classifier with the 2D descriptors obtained from SISSO/RF-SISSO symbolic regression are shown in Fig.3. Blue dots represent metals, red dots non-metals, and the yellow line the decision boundary. The classification for SISSO training and testing on one dataset of 45-sample subset are shown in Fig.3(a) and (b), while the same for RF-SISSO are shown in Fig.3(c) and (d). The RF-SISSO’s classification boundary [Fig.3 and its inset] is notably clearer than that of SISSO [Fig.3(b) and its inset]. (See supporting information for other dataset’s visualizations: Figs. S2−S4.)
For a comparison, we selected the top 6 and top 10 features as datasets and repeated the experiment. With a 45-sample subset [Fig.4(a)], the testing accuracy is the highest with the 8-feature’s regression. With 10 features, accuracies of three training and one testing sets are below 0.9, which indicates more features are not necessarily better. For the 6 features’ case, accuracies of two training and one testing sets are below 0.9, showing that fewer features are also not optimal. Additionally, Fig.4(b) shows that operator regression time is the shortest with 8 features, indicating a maximum efficiency.
Recently, Chong
et al. [
30] had established interpretable spectral-property relationships with the SISSO algorithm on small datasets. It is valuable to see whether RF-SISSO can make an improvement in regression efficiency and accuracy for these datasets. As shown in Table S6, operating the data of small datasets with sample sizes of 20 and 40, the results of RF-SISSO were improved in most cases. By removing redundant features, the efficiency of regression was significantly enhanced (Fig. S5). It is clear that RF-SISSO does have the advantage than the original SISSO, particularly the efficiency.
Xu and Qian [
22] had proposed an improved SISSO (i-SISSO) by integrating MI and mRMR algorithms to address SISSO’s limitations in high-dimensional model generation. As shown in Fig.5, we have also compared i-SISSO and RF-SISSO with the same parameters on seven raw datasets from Xu and Qian [
22] The RMSE of different systems [Fig.5(a)] are almost identical for the two methods, while the regression times show significant difference [Fig.5(b)]. As regression times of some systems are too short to be clearly observed, the ratios of regression times are presented in the inset of Fig.5(b) for an easier comparison. Obviously, RF-SISSO is once more advancing in terms of regression efficiency while the difference in accuracy is neglectable.
4 Conclusion
In conclusion, our study demonstrates the improvements achieved by integrating the Random Forests algorithm with SISSO, resulting in the enhanced RF-SISSO method. This combination enables SISSO to maintain high-precision predictions even with small datasets. RF-SISSO not only improves prediction accuracy across various dataset sizes but also produces more concise 2D descriptors. Furthermore, RF-SISSO enhances regression efficiency, being up to 265 times faster than SISSO for the smallest subset. These findings highlight the robustness and efficiency of RF-SISSO, making it a valuable tool for material descriptor regression across diverse dataset sizes. The complementarity of the Random Forests algorithm and the SISSO algorithm can capture non-linear relationships more effectively, resulting in more accurate predictive models with good generalization ability. This approach also reduces computational complexity and improves operational efficiency.