Boosting SISSO performance on small sample datasets by using Random Forests prescreening for complex feature selection

Xiaolin Jiang, Guanqi Liu, Jiaying Xie, Zhenpeng Hu

PDF(3649 KB)
PDF(3649 KB)
Front. Phys. ›› 2025, Vol. 20 ›› Issue (1) : 014209. DOI: 10.15302/frontphys.2025.014209
RESEARCH ARTICLE

Boosting SISSO performance on small sample datasets by using Random Forests prescreening for complex feature selection

Author information +
History +

Abstract

In materials science, data-driven methods accelerate material discovery and optimization while reducing costs and improving success rates. Symbolic regression is a key to extracting material descriptors from large datasets, in particular the Sure Independence Screening and Sparsifying Operator (SISSO) method. While SISSO needs to store the entire expression space to impose heavy memory demands, it limits the performance in complex problems. To address this issue, we propose a RF-SISSO algorithm by combining Random Forests (RF) with SISSO. In this algorithm, the Random Forests algorithm is used for prescreening, capturing non-linear relationships and improving feature selection, which may enhance the quality of the input data and boost the accuracy and efficiency on regression and classification tasks. For a testing on the SISSO’s verification problem for 299 materials, RF-SISSO demonstrates its robust performance and high accuracy. RF-SISSO can maintain the testing accuracy above 0.9 across all four training sample sizes and significantly enhancing regression efficiency, especially in training subsets with smaller sample sizes. For the training subset with 45 samples, the efficiency of RF-SISSO was 265 times higher than that of original SISSO. As collecting large datasets would be both costly and time-consuming in the practical experiments, it is thus believed that RF-SISSO may benefit scientific researches by offering a high predicting accuracy with limited data efficiently.

Graphical abstract

Keywords

Random Forests algorithm / SISSO / symbolic regression algorithm / machine learning / small datasets / prescreening / complex feature selection

Cite this article

Download citation ▾
Xiaolin Jiang, Guanqi Liu, Jiaying Xie, Zhenpeng Hu. Boosting SISSO performance on small sample datasets by using Random Forests prescreening for complex feature selection. Front. Phys., 2025, 20(1): 014209 https://doi.org/10.15302/frontphys.2025.014209

References

[1]
A. Agrawal and A. Choudhary, Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science, APL Mater. 4(5), 053208 (2016)
CrossRef ADS Google scholar
[2]
L. Himanen, A. Geurts, A. S. Foster, and P. Rinke, Data‐driven materials science: Status, challenges, and perspectives, Adv. Sci. 6(21), 1900808 (2019)
CrossRef ADS Google scholar
[3]
G. R. Schleder, A. C. M. Padilha, C. M. Acosta, M. Costa, and A. Fazzio, From DFT to machine learning: Recent approaches to materials science – a review, J. Phys.: Mater. 2(3), 032001 (2019)
CrossRef ADS Google scholar
[4]
C.DraxlM. Scheffler, NOMAD: The FAIR concept for big data-driven materials science, MRS Bull. 43(9), 676 (2018)
[5]
J. H. Wang, J. N. Jia, S. Sun, and T. Y. Zhang, Statistical learning of small data with domain knowledge – sample size-and pre-notch length-dependent strength of concrete, Eng. Fract. Mech. 259, 108160 (2022)
CrossRef ADS Google scholar
[6]
Y. Wang, N. Wagner, and J. M. Rondinelli, Symbolic regression in materials science, MRS Commun. 9(3), 793 (2019)
CrossRef ADS Google scholar
[7]
S. Sun, R. Ouyang, B. Zhang, and T. Y. Zhang, Data-driven discovery of formulas by symbolic regression, MRS Bull. 44(7), 559 (2019)
CrossRef ADS Google scholar
[8]
C. J. Bartel, C. Sutton, B. R. Goldsmith, R. Ouyang, C. B. Musgrave, and M. Scheffler, New tolerance factor to predict the stability of perovskite oxides and halides, Sci. Adv. 5(2), eaav0693 (2019)
CrossRef ADS Google scholar
[9]
B.WengZ. SongR.ZhuQ.YanQ.Sun C.G. GriceY. YanW.J. Yin, Simple descriptor derived from symbolic regression accelerating the discovery of new perovskite catalysts, Nat. Commun. 11, 3513 (2020)
[10]
M. Schmidt and H. Lipson, Distilling free-form natural laws from experimental data, Science 324(5923), 81 (2009)
CrossRef ADS Google scholar
[11]
S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz, Data-driven discovery of partial differential equations, Sci. Adv. 3(4), e1602614 (2017)
CrossRef ADS Google scholar
[12]
S. L. Brunton, J. L. Proctor, and J. N. Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proc. Natl. Acad. Sci. USA 113(15), 3932 (2016)
CrossRef ADS Google scholar
[13]
R.OuyangS. CurtaroloE.AhmetcikM.SchefflerL.M. Ghiringhelli, SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates, Phys. Rev. Mater. 2, 083802 (2018)
[14]
Z.-K. HanD. SarkerR.OuyangA.MazheikaY.Gao S.V. Levchenko, Single-atom alloy catalysts designed by first-principles calculations and artificial intelligence, Nat. Commun. 12, 1833 (2021)
[15]
G. Cao, R. Ouyang, L. M. Ghiringhelli, M. Scheffler, H. Liu, C. Carbogno, and Z. Zhang, Artificial intelligence for high-throughput discovery of topological insulators: The example of alloyed tetradymites, Phys. Rev. Mater. 4, 034204 (2020)
CrossRef ADS Google scholar
[16]
Z. Guo, S. Hu, Z. K. Han, and R. Ouyang, Improving symbolic regression for predicting materials properties with iterative variable selection, J. Chem. Theory Comput. 18(8), 4945 (2022)
CrossRef ADS Google scholar
[17]
I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3(Mar), 1157 (2003)
[18]
G.ChandrashekarF.Sahin, A survey on feature selection methods, Comput. Electr. Eng. 40(1), 16 (2014)
[19]
J.ImS.Lee T.W. KoH. W. KimY.HyonH.Chang, Identifying Pb-free perovskites for solar cells by machine learning, npj Comput. Mater. 5, 37 (2019)
[20]
S. Lu, Q. Zhou, L. Ma, Y. Guo, and J. Wang, Rapid discovery of ferroelectric photovoltaic perovskites and material descriptors via machine learning, Small Methods 3(11), 1900360 (2019)
CrossRef ADS Google scholar
[21]
Y. Diao, L. Yan, and K. Gao, Improvement of the machine learning-based corrosion rate prediction model through the optimization of input features, Mater. Des. 198, 109326 (2021)
CrossRef ADS Google scholar
[22]
Y.XuQ.Qian, i-SISSO: Mutual information-based improved sure independent screening and sparsifying operator algorithm, Eng. Appl. Artif. Intell. 116, 105442 (2022)
[23]
A. Kraskov, H. Stögbauer, and P. Grassberger, Estimating mutual information, Phys. Rev. E 69(6), 066138 (2004)
CrossRef ADS Google scholar
[24]
H. Peng, F. Long, and C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226 (2005)
CrossRef ADS Google scholar
[25]
M. Krzywinski and N. Altman, Classification and regression trees, Nat. Methods 14, 757 (2017)
CrossRef ADS Google scholar
[26]
T.Hastie, ., Random Forests, in: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2009, pp 587–604
[27]
L. Breiman, Random Forests, Mach. Learn. 45(1), 5 (2001)
CrossRef ADS Google scholar
[28]
See: materials.springer.com/
[29]
B.E. BoserI. M. GuyonV.N. Vapnik, A training algorithm for optimal margin classifiers, in: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992
[30]
Y. Chong, Y. Huo, S. Jiang, X. Wang, B. Zhang, T. Liu, X. Chen, T. Han, P. E. S. Smith, S. Wang, and J. Jiang, Machine learning of spectra-property relationship for imperfect and small chemistry data, Proc. Natl. Acad. Sci. USA 120(20), e2220789120 (2023)
CrossRef ADS Google scholar

Declarations

The authors declare no competing interests and no conflicts.

Electronic supplementary materials

The online version contains supplementary material available at https://doi.org/10.15302/frontphys.2025.014209.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. 21933006 and 21773124), the Fundamental Research Funds for the Central Universities of Nankai University (Nos. 63243091 and 63233001), and the Supercomputing Center of Nankai University (NKSC).

RIGHTS & PERMISSIONS

2024 Higher Education Press
AI Summary AI Mindmap
PDF(3649 KB)

Accesses

Citations

Detail

Sections
Recommended

/