Boosting SISSO performance on small sample datasets by using Random Forests prescreening for complex feature selection

Xiaolin Jiang; Guanqi Liu; Jiaying Xie; Zhenpeng Hu

doi:10.15302/frontphys.2025.014209

PDF(3649 KB)

Front. Phys. ›› 2025, Vol. 20 ›› Issue (1) : 014209. DOI: 10.15302/frontphys.2025.014209

RESEARCH ARTICLE

Boosting SISSO performance on small sample datasets by using Random Forests prescreening for complex feature selection

Author information +

History +

Abstract

In materials science, data-driven methods accelerate material discovery and optimization while reducing costs and improving success rates. Symbolic regression is a key to extracting material descriptors from large datasets, in particular the Sure Independence Screening and Sparsifying Operator (SISSO) method. While SISSO needs to store the entire expression space to impose heavy memory demands, it limits the performance in complex problems. To address this issue, we propose a RF-SISSO algorithm by combining Random Forests (RF) with SISSO. In this algorithm, the Random Forests algorithm is used for prescreening, capturing non-linear relationships and improving feature selection, which may enhance the quality of the input data and boost the accuracy and efficiency on regression and classification tasks. For a testing on the SISSO’s verification problem for 299 materials, RF-SISSO demonstrates its robust performance and high accuracy. RF-SISSO can maintain the testing accuracy above 0.9 across all four training sample sizes and significantly enhancing regression efficiency, especially in training subsets with smaller sample sizes. For the training subset with 45 samples, the efficiency of RF-SISSO was 265 times higher than that of original SISSO. As collecting large datasets would be both costly and time-consuming in the practical experiments, it is thus believed that RF-SISSO may benefit scientific researches by offering a high predicting accuracy with limited data efficiently.

Graphical abstract

Keywords

Random Forests algorithm / SISSO / symbolic regression algorithm / machine learning / small datasets / prescreening / complex feature selection

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Xiaolin Jiang, Guanqi Liu, Jiaying Xie, Zhenpeng Hu. Boosting SISSO performance on small sample datasets by using Random Forests prescreening for complex feature selection. Front. Phys., 2025, 20(1): 014209 https://doi.org/10.15302/frontphys.2025.014209

1 Introduction

Data-driven approaches in materials science significantly accelerate material discovery and optimization, reducing costs and enhancing the success rate of material development [1–5]. Symbolic regression is an effective data-driven modeling technique that automatically discovers mathematical expressions from data, representing relationships between variables [6–12]. Among the numerous symbolic regression algorithms, the Sure Independence Screening and Sparsifying Operator (SISSO) [13] introduced by Ouyang et al. has attracted people’s attention. It can extract material descriptors from large datasets, which is applicable across various fields of study [8, 14, 15]. It generates a large feature space from the original space, and then selecting features from this new space to build models with compressed sensing algorithm.

Consequently, SISSO’s requirement to create and partially store the entire expression space leads to exponentially growing memory demands with increasing features and complexity, making it resource-intensive for complex problems. To address this issue, Ouyang et al. [16] developed VS-SISSO, which combines symbolic regression with iterative variable selection (random search) [17, 18] to optimize the model with numerous input features [19–21]. Alternatively, Xu and Qian [22] proposed i-SISSO [23], which integrates mutual information (MI) and minimum redundancy maximum relevance (mRMR) algorithms [24] to optimize feature combinations for maximum relevance and minimum redundancy. Obviously, the number of input features directly affects the performance of SISSO, VS-SISSO, or i-SISSO, as all input features are considered in the SO process. Therefore, we propose a more straightforward idea that performing a prescreening to get the important features for an effective input of SISSO may reduce the computational complexity and storage costs, and save time. The decision tree model [25] can describe nonlinear relationships efficiency for datasets of different sizes, which may help us to realize the above idea. Since a single tree model may result in high variability, it is believed the Random Forests algorithm [26, 27] should be a better choice for the prescreening of features.

On the other hand, it would be expensive and time-intensive in the experiments to gather extensive datasets, especially for physics, chemistry, material, and life science. The lack of abundant data may reduce the performance of machine learning algorithms for those areas, in particular exploring new phenomena. Random Forests generate multiple tree models based on subsets obtained from bootstrapped samples, then vote on data importance. This resampling method effectively enlarges datasets, which is naturally friendly to the scientific researches with limited data. It would be expected to combine Random Forests with SISSO for those researches.

Herein, we combine Random Forests with SISSO to describe certain nonlinear relationships, resulting in RF-SISSO. Taking the SISSO’s verification problem for 299 materials [13, 28] as an example, RF-SISSO and SISSO were compared on training datasets of various sizes. The training sample sizes of 224, 150, 75, and 45 were randomly selected from the dataset of 299 materials, while the other samples’ data were used for testing in each case. For five parallel testing on each training sample size cases, RF-SISSO maintained a predicting accuracy above 0.9 in all cases, whereas SISSO’s accuracy was below 0.9 in the 45-sample subsets. As the prescreening by Random Forests effectively reduce the number of input features, RF-SISSO’s descriptor regression efficiency was higher than that of original SISSO in all cases, notably reducing time costs. Meanwhile, the RF-SISSO algorithm were also compared with i-SISSO algorithm [22], where RF-SISSO demonstrated higher regression efficiency than that of i-SISSO with a comparable accuracy.

2 Method

2.1 Experimental dataset and environments

SISSO and RF-SISSO were tested on the

A_{x} B_{y}

classification problem of metal/non-metal binary materials using experimental data collected by the authors who developed SISSO [13, 28]. The data sources included the WebElements (atomic) and SpringerMatters (structural) databases.

The features analyzed were:

∙

Pauling electronegativity

(χ) : χ_{A}, χ_{B}

∙

Ionization energy

(I E) : I E_{A}, I E_{B}

∙

Covalent radius

(r c o v) : r c o v_{A}, {r c o v}_{B}

∙

Electron affinity

(E A) : E A_{A}, E A_{B}

∙

Number of valence electrons

(ν) : ν_{A}, ν_{B}

∙

Coordination number

(C N) : C N_{A}, C N_{B}

∙

Interatomic distance

: d_{A B}

∙

Atomic composition

: n A (n B = 1 - n A)

∙

Packing fraction

: V (V_{a t o m} / V_{c e l l}, w h e r e V_{a t o m} = 4 π (r c o v)^{3} / 3)

Combining data from WebElements and SpringerMatters resulted in 15 prototypes, covering a total of 299 materials and 16 features. The testing was conducted on hardware configured with an Intel(R) Xeon(R) Gold 5218 CPU @ 2.30 GHz, 128 GB memory, and running on 4 cores.

2.2 Experimental process

2.2.1 Feature evaluation using Random Forests

∙

The features of all materials were evaluated using the Random Forests algorithm. The importance of each feature was determined using the Gini coefficient.

∙

Due to the small dataset, both sample division and Random Forests selection were randomized, which led to varying prediction accuracy for each cross-validation and potential experimental contingencies. To mitigate this, feature assessment was repeated at least 50 times using the Random Forests algorithm.

∙

The importance scores for each feature were obtained and ranked. These scores, once normalized, allowed for a clear comparison of feature importance. The features, those with an importance score of 5 or higher, were selected for the SISSO calculation.

2.2.2 Operator regression using SISSO

∙

After identifying important features with Random Forests, SISSO was employed to filter key features further. This process yielded two-dimensional descriptors for classification through symbolic regression.

∙

The total sample set was divided into four subsets, with training samples’ sizes of 224, 150, 75, and 45. Each subset was randomly selected five times. The flowchart of the RF-SISSO algorithm is shown in Fig.1.

Fig.1 Flowchart of RF-SISSO algorithm.

Full size|PPT slide

∙

RF-SISSO selects the top 8 features with importance scores of 5 or higher following the feature ranking by Random Forests. SISSO, without the integration of Random Forests, continues to be represented by the original 16 features.

∙

The accuracy of the descriptors obtained from original SISSO and RF-SISSO regressions were evaluated using the Support Vector Classifier (SVC) [29].

For detailed principles of algorithm application, please refer to the supporting information and original literature [13, 26, 27].

3 Result

3.1 Random Forests algorithm feature screening and performance of SISSO (SISSO & RF-SISSO) in datasets of varying sizes

The study evaluated the importance of features using the Random Forests algorithm. Fig.2(a) shows a histogram of the features affecting metal/nonmetal properties as determined by the Random Forests analysis. The top eight important features with feature importance ratings above 5 were identified: the electronegativity of the “A” atom and “B” atom (

χ_{A} a n d χ_{B}

), packing fraction (

V

), valence electron number of the “A” atom and “B” atom (

ν_{A} a n d ν_{B}

), electron affinity energy of the “A” atom and “B” atom (

E A_{A} a n d E A_{B}

), and ionization energy of the “B” atom (

I E_{B}

Fig.2 (a) Histogram of feature importance assessment. (b) Accuracy of training and testing of 2D descriptors for datasets of various sizes. (c) Mean regression time for SISSO vs. RF-SISSO trained operators (4 cpu cores). (d) Training efficiency ratio of SISSO vs. RF-SISSO (4 cpu cores).

Full size|PPT slide

To compare the accuracies of models from SISSO and RF-SISSO, we used the obtained 2D descriptors as training features to classify the metal/nonmetal materials using the SVC classifier and recorded their classification accuracies. For a simple presentation, the accuracy of SISSO/RF-SISSO is used to represent the accuracy of 2D descriptors from SISSO/RF-SISSO. As shown in Fig.2(b), the accuracies of both SISSO and RF-SISSO are above 0.9 with the 224-sample subset, where SISSO generally performs slightly better. However, RF-SISSO’s accuracy tends to increase as the dataset size decreases. With the 150-sample subset, RF-SISSO outperformed SISSO in two training sets and matched SISSO in one testing set. With 75-sample subset, RF-SISSO had one training set significantly higher than SISSO and outperformed SISSO in three testing sets. For the 45-sample subset, RF-SISSO consistently outperformed SISSO, with all training accuracies significantly higher and only one testing set lower. In this case, two training sets of SISSO have accuracy lower than 0.9, which demonstrates RF-SISSO’s advantage in achieving high accuracy with smaller datasets. Here, RF-SISSO can achieve higher accuracy than SISSO is mainly attributed to RF’s ability to capture complex nonlinear relationships and interactions between features and target properties. By identifying and removing irrelevant or redundant features with RF prescreening, the efficiently reduced feature space allows SISSO to operate more efficiently and focus on the most informative features. It helps to mitigate the risk of overfitting, as the model is less likely to learn noise from irrelevant features.

The major advantage of RF-SISSO over SISSO is its shorter operator regression time and higher training efficiency. Fig.2(c) and (d) show that the average operator regression time of SISSO is about 26 times longer than that of RF-SISSO for the 224-sample and 150-sample subsets, 39 times longer for the 75-sample subset, and up to 265 times longer for the 45-sample subset. This demonstrates that by using RF to eliminate redundant features, the regression efficiency of SISSO is effectively improved. In addition, we used the Random Forests algorithm alone to classify the data. As shown in Table S1, the testing accuracy of RF is not as good as SISSO, but the computing time is much shorter. The complementarity between RF and SISSO make RF-SISSO to have a more efficient processing of high-dimensional datasets to construct well-generalized accurate models.

The two-dimensional descriptors obtained by symbolic regression of SISSO and RF-SISSO under different dataset scales were presented in Tab.1. The descriptors were derived from the set of data with the highest accuracy in the five parallel sets for each sample size. Comparing the descriptors of SISSO and RF-SISSO across different data scales reveals that the complexity of RF-SISSO descriptors is consistently lower than that of SISSO. Notably, the selected features may vary in each experiment due to differences in the training set, even when the number of initial features is not limited in SISSO. The features in the RF-SISSO descriptor are similar to those of SISSO but more meaningful in physics, where the combinations like

V / χ_{A}

and

χ_{B} / χ_{A}

recurs. It indicates that stronger electronegativity leads to increased exclusivity rather than sharing of charge, making it easier to form non-metals. Additional descriptors are available in the supporting information (Tables S2−S5).

Tab.1 Two-dimensional descriptor derived by symbolic regression using SISSO and RF-SISSO across datasets of different subsets.

Datasets		SISSO	RF-SISSO
224	d₁	$χ_{A} / (V \times χ_{B}) + \exp (- C N_{B} / n B)$	$(V \times I E_{B})^{2} \times \exp (E A_{A} - χ_{A})$
224	d₂	$\| χ_{B} / (V \times I E_{B}) \| - n B \times \| n A - n B \|$	$\| \| 1 - χ_{B} / χ_{A} - I E_{B} / (E A_{A} - χ_{B}) \| \|$
150	d₁	$d_{A B} \times (V + n B) \times (χ_{A} / n A - I E_{B})$	$(I E_{B} - 2 χ_{B}) / (χ_{A} / V)^{2}$
150	d₂	$[I E_{B} \times χ_{B} \times (d_{A B} + r c o v_{B})] / (χ_{A} / V)$	$\| (χ_{B} - E A_{B}) - \| (E A_{A} + E A_{B}) - \| E A_{B} - χ_{B} \| \| \|$
75	d₁	$(χ_{A} / n A - I E_{B}) \times V \times (d_{A B} + r c o v_{A})$	$(V / χ_{A})^{2} \times χ_{B} \times (E A_{A} - I E_{B})$
75	d₂	$(V - n A) \times C N_{B} \times I E_{B} \times χ_{B} / χ_{A}$	$\exp (I E_{B}) / (χ_{A} / E A_{B} - χ_{B} / χ_{A})$
45	d₁	$V^{2} \times χ_{B} \times (d_{A B} \times I E_{B})^{3}$	$(V / χ_{A})^{3} \times (I E_{B} + χ_{B})^{3}$
45	d₂	$V^{2} \times I E_{B} \times (d_{A B} \times I E_{B})^{3}$	$(V / χ_{A})^{3} \times (E A_{A} + χ_{B})^{3}$

The results of metal/non-metal classification using the SVC classifier with the 2D descriptors obtained from SISSO/RF-SISSO symbolic regression are shown in Fig.3. Blue dots represent metals, red dots non-metals, and the yellow line the decision boundary. The classification for SISSO training and testing on one dataset of 45-sample subset are shown in Fig.3(a) and (b), while the same for RF-SISSO are shown in Fig.3(c) and (d). The RF-SISSO’s classification boundary [Fig.3 and its inset] is notably clearer than that of SISSO [Fig.3(b) and its inset]. (See supporting information for other dataset’s visualizations: Figs. S2−S4.)

Fig.3 SVC classification visualizations for 2D descriptors obtained from SISSO and RF-SISSO in a 45-sample subset. (a) training and (b) testing sets with operators from SISSO; (c) training and (d) testing sets with operators from RF-SISSO.

Full size|PPT slide

For a comparison, we selected the top 6 and top 10 features as datasets and repeated the experiment. With a 45-sample subset [Fig.4(a)], the testing accuracy is the highest with the 8-feature’s regression. With 10 features, accuracies of three training and one testing sets are below 0.9, which indicates more features are not necessarily better. For the 6 features’ case, accuracies of two training and one testing sets are below 0.9, showing that fewer features are also not optimal. Additionally, Fig.4(b) shows that operator regression time is the shortest with 8 features, indicating a maximum efficiency.

Fig.4 Comparison of RF-SISSO: (a) accuracy, and (b) operator regression time for a 45-sample subset with varying quantitative features.

Full size|PPT slide

Recently, Chong et al. [30] had established interpretable spectral-property relationships with the SISSO algorithm on small datasets. It is valuable to see whether RF-SISSO can make an improvement in regression efficiency and accuracy for these datasets. As shown in Table S6, operating the data of small datasets with sample sizes of 20 and 40, the results of RF-SISSO were improved in most cases. By removing redundant features, the efficiency of regression was significantly enhanced (Fig. S5). It is clear that RF-SISSO does have the advantage than the original SISSO, particularly the efficiency.

Xu and Qian [22] had proposed an improved SISSO (i-SISSO) by integrating MI and mRMR algorithms to address SISSO’s limitations in high-dimensional model generation. As shown in Fig.5, we have also compared i-SISSO and RF-SISSO with the same parameters on seven raw datasets from Xu and Qian [22] The RMSE of different systems [Fig.5(a)] are almost identical for the two methods, while the regression times show significant difference [Fig.5(b)]. As regression times of some systems are too short to be clearly observed, the ratios of regression times are presented in the inset of Fig.5(b) for an easier comparison. Obviously, RF-SISSO is once more advancing in terms of regression efficiency while the difference in accuracy is neglectable.

Fig.5 Comparison of the i-SISSO and RF-SISSO algorithms: (a) accuracy, (b) operator regression time (inset figure shows the ratio of regression time).

Full size|PPT slide

4 Conclusion

In conclusion, our study demonstrates the improvements achieved by integrating the Random Forests algorithm with SISSO, resulting in the enhanced RF-SISSO method. This combination enables SISSO to maintain high-precision predictions even with small datasets. RF-SISSO not only improves prediction accuracy across various dataset sizes but also produces more concise 2D descriptors. Furthermore, RF-SISSO enhances regression efficiency, being up to 265 times faster than SISSO for the smallest subset. These findings highlight the robustness and efficiency of RF-SISSO, making it a valuable tool for material descriptor regression across diverse dataset sizes. The complementarity of the Random Forests algorithm and the SISSO algorithm can capture non-linear relationships more effectively, resulting in more accurate predictive models with good generalization ability. This approach also reduces computational complexity and improves operational efficiency.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	A. Agrawal and A. Choudhary, Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science, APL Mater. 4(5), 053208 (2016) CrossRef ADS Google scholar

[2]	L. Himanen, A. Geurts, A. S. Foster, and P. Rinke, Data‐driven materials science: Status, challenges, and perspectives, Adv. Sci. 6(21), 1900808 (2019) CrossRef ADS arXiv Google scholar

[3]	G. R. Schleder, A. C. M. Padilha, C. M. Acosta, M. Costa, and A. Fazzio, From DFT to machine learning: Recent approaches to materials science – a review, J. Phys.: Mater. 2(3), 032001 (2019) CrossRef ADS Google scholar

[4]	C. Draxl,M. Scheffler, NOMAD: The FAIR concept for big data-driven materials science, MRS Bull. 43(9), 676 (2018)

[5]	J. H. Wang, J. N. Jia, S. Sun, and T. Y. Zhang, Statistical learning of small data with domain knowledge – sample size-and pre-notch length-dependent strength of concrete, Eng. Fract. Mech. 259, 108160 (2022) CrossRef ADS Google scholar

[6]	Y. Wang, N. Wagner, and J. M. Rondinelli, Symbolic regression in materials science, MRS Commun. 9(3), 793 (2019) CrossRef ADS arXiv Google scholar

[7]	S. Sun, R. Ouyang, B. Zhang, and T. Y. Zhang, Data-driven discovery of formulas by symbolic regression, MRS Bull. 44(7), 559 (2019) CrossRef ADS Google scholar

[8]	C. J. Bartel, C. Sutton, B. R. Goldsmith, R. Ouyang, C. B. Musgrave, and M. Scheffler, New tolerance factor to predict the stability of perovskite oxides and halides, Sci. Adv. 5(2), eaav0693 (2019) CrossRef ADS arXiv Google scholar

[9]	B. Weng,Z. Song,R. Zhu,Q. Yan,Q. Sun, C. G. Grice,Y. Yan,W. J. Yin, Simple descriptor derived from symbolic regression accelerating the discovery of new perovskite catalysts, Nat. Commun. 11, 3513 (2020)

[10]	M. Schmidt and H. Lipson, Distilling free-form natural laws from experimental data, Science 324(5923), 81 (2009) CrossRef ADS Google scholar

[11]	S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz, Data-driven discovery of partial differential equations, Sci. Adv. 3(4), e1602614 (2017) CrossRef ADS Google scholar

[12]	S. L. Brunton, J. L. Proctor, and J. N. Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proc. Natl. Acad. Sci. USA 113(15), 3932 (2016) CrossRef ADS arXiv Google scholar

[13]	R. Ouyang,S. Curtarolo,E. Ahmetcik,M. Scheffler,L. M. Ghiringhelli, SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates, Phys. Rev. Mater. 2, 083802 (2018)

[14]	Z. -K. Han,D. Sarker,R. Ouyang,A. Mazheika,Y. Gao, S. V. Levchenko, Single-atom alloy catalysts designed by first-principles calculations and artificial intelligence, Nat. Commun. 12, 1833 (2021)

[15]	G. Cao, R. Ouyang, L. M. Ghiringhelli, M. Scheffler, H. Liu, C. Carbogno, and Z. Zhang, Artificial intelligence for high-throughput discovery of topological insulators: The example of alloyed tetradymites, Phys. Rev. Mater. 4, 034204 (2020) CrossRef ADS arXiv Google scholar

[16]	Z. Guo, S. Hu, Z. K. Han, and R. Ouyang, Improving symbolic regression for predicting materials properties with iterative variable selection, J. Chem. Theory Comput. 18(8), 4945 (2022) CrossRef ADS Google scholar

[17]	I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3(Mar), 1157 (2003)

[18]	G. Chandrashekar,F. Sahin, A survey on feature selection methods, Comput. Electr. Eng. 40(1), 16 (2014)

[19]	J. Im,S. Lee,T. W. Ko,H. W. Kim,Y. Hyon, H. Chang, Identifying Pb-free perovskites for solar cells by machine learning, npj Comput. Mater. 5, 37 (2019)

[20]	S. Lu, Q. Zhou, L. Ma, Y. Guo, and J. Wang, Rapid discovery of ferroelectric photovoltaic perovskites and material descriptors via machine learning, Small Methods 3(11), 1900360 (2019) CrossRef ADS Google scholar

[21]	Y. Diao, L. Yan, and K. Gao, Improvement of the machine learning-based corrosion rate prediction model through the optimization of input features, Mater. Des. 198, 109326 (2021) CrossRef ADS Google scholar

[22]	Y. Xu,Q. Qian, i-SISSO: Mutual information-based improved sure independent screening and sparsifying operator algorithm, Eng. Appl. Artif. Intell. 116, 105442 (2022)

[23]	A. Kraskov, H. Stögbauer, and P. Grassberger, Estimating mutual information, Phys. Rev. E 69(6), 066138 (2004) CrossRef ADS Google scholar

[24]	H. Peng, F. Long, and C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226 (2005) CrossRef ADS Google scholar

[25]	M. Krzywinski and N. Altman, Classification and regression trees, Nat. Methods 14, 757 (2017) CrossRef ADS Google scholar

[26]	T. Hastie, ., Random Forests, in: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2009, pp 587–604

[27]	L. Breiman, Random Forests, Mach. Learn. 45(1), 5 (2001) CrossRef ADS Google scholar

[28]	See: materials.springer.com/

[29]	B. E. Boser,I. M. Guyon,V. N. Vapnik, A training algorithm for optimal margin classifiers, in: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992

[30]	Y. Chong, Y. Huo, S. Jiang, X. Wang, B. Zhang, T. Liu, X. Chen, T. Han, P. E. S. Smith, S. Wang, and J. Jiang, Machine learning of spectra-property relationship for imperfect and small chemistry data, Proc. Natl. Acad. Sci. USA 120(20), e2220789120 (2023) CrossRef ADS Google scholar

Declarations

The authors declare no competing interests and no conflicts.

Electronic supplementary materials

The online version contains supplementary material available at https://doi.org/10.15302/frontphys.2025.014209.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. 21933006 and 21773124), the Fundamental Research Funds for the Central Universities of Nankai University (Nos. 63243091 and 63233001), and the Supercomputing Center of Nankai University (NKSC).