Boosting SISSO performance on small sample datasets by using Random Forests prescreening for complex feature selection
Xiaolin Jiang, Guanqi Liu, Jiaying Xie, Zhenpeng Hu
Boosting SISSO performance on small sample datasets by using Random Forests prescreening for complex feature selection
In materials science, data-driven methods accelerate material discovery and optimization while reducing costs and improving success rates. Symbolic regression is a key to extracting material descriptors from large datasets, in particular the Sure Independence Screening and Sparsifying Operator (SISSO) method. While SISSO needs to store the entire expression space to impose heavy memory demands, it limits the performance in complex problems. To address this issue, we propose a RF-SISSO algorithm by combining Random Forests (RF) with SISSO. In this algorithm, the Random Forests algorithm is used for prescreening, capturing non-linear relationships and improving feature selection, which may enhance the quality of the input data and boost the accuracy and efficiency on regression and classification tasks. For a testing on the SISSO’s verification problem for 299 materials, RF-SISSO demonstrates its robust performance and high accuracy. RF-SISSO can maintain the testing accuracy above 0.9 across all four training sample sizes and significantly enhancing regression efficiency, especially in training subsets with smaller sample sizes. For the training subset with 45 samples, the efficiency of RF-SISSO was 265 times higher than that of original SISSO. As collecting large datasets would be both costly and time-consuming in the practical experiments, it is thus believed that RF-SISSO may benefit scientific researches by offering a high predicting accuracy with limited data efficiently.
Random Forests algorithm / SISSO / symbolic regression algorithm / machine learning / small datasets / prescreening / complex feature selection
[1] |
A. Agrawal and A. Choudhary, Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science, APL Mater. 4(5), 053208 (2016)
CrossRef
ADS
Google scholar
|
[2] |
L. Himanen, A. Geurts, A. S. Foster, and P. Rinke, Data‐driven materials science: Status, challenges, and perspectives, Adv. Sci. 6(21), 1900808 (2019)
CrossRef
ADS
Google scholar
|
[3] |
G. R. Schleder, A. C. M. Padilha, C. M. Acosta, M. Costa, and A. Fazzio, From DFT to machine learning: Recent approaches to materials science – a review, J. Phys.: Mater. 2(3), 032001 (2019)
CrossRef
ADS
Google scholar
|
[4] |
C.DraxlM. Scheffler, NOMAD: The FAIR concept for big data-driven materials science, MRS Bull. 43(9), 676 (2018)
|
[5] |
J. H. Wang, J. N. Jia, S. Sun, and T. Y. Zhang, Statistical learning of small data with domain knowledge – sample size-and pre-notch length-dependent strength of concrete, Eng. Fract. Mech. 259, 108160 (2022)
CrossRef
ADS
Google scholar
|
[6] |
Y. Wang, N. Wagner, and J. M. Rondinelli, Symbolic regression in materials science, MRS Commun. 9(3), 793 (2019)
CrossRef
ADS
Google scholar
|
[7] |
S. Sun, R. Ouyang, B. Zhang, and T. Y. Zhang, Data-driven discovery of formulas by symbolic regression, MRS Bull. 44(7), 559 (2019)
CrossRef
ADS
Google scholar
|
[8] |
C. J. Bartel, C. Sutton, B. R. Goldsmith, R. Ouyang, C. B. Musgrave, and M. Scheffler, New tolerance factor to predict the stability of perovskite oxides and halides, Sci. Adv. 5(2), eaav0693 (2019)
CrossRef
ADS
Google scholar
|
[9] |
B.WengZ. SongR.ZhuQ.YanQ.Sun C.G. GriceY. YanW.J. Yin, Simple descriptor derived from symbolic regression accelerating the discovery of new perovskite catalysts, Nat. Commun. 11, 3513 (2020)
|
[10] |
M. Schmidt and H. Lipson, Distilling free-form natural laws from experimental data, Science 324(5923), 81 (2009)
CrossRef
ADS
Google scholar
|
[11] |
S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz, Data-driven discovery of partial differential equations, Sci. Adv. 3(4), e1602614 (2017)
CrossRef
ADS
Google scholar
|
[12] |
S. L. Brunton, J. L. Proctor, and J. N. Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proc. Natl. Acad. Sci. USA 113(15), 3932 (2016)
CrossRef
ADS
Google scholar
|
[13] |
R.OuyangS. CurtaroloE.AhmetcikM.SchefflerL.M. Ghiringhelli, SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates, Phys. Rev. Mater. 2, 083802 (2018)
|
[14] |
Z.-K. HanD. SarkerR.OuyangA.MazheikaY.Gao S.V. Levchenko, Single-atom alloy catalysts designed by first-principles calculations and artificial intelligence, Nat. Commun. 12, 1833 (2021)
|
[15] |
G. Cao, R. Ouyang, L. M. Ghiringhelli, M. Scheffler, H. Liu, C. Carbogno, and Z. Zhang, Artificial intelligence for high-throughput discovery of topological insulators: The example of alloyed tetradymites, Phys. Rev. Mater. 4, 034204 (2020)
CrossRef
ADS
Google scholar
|
[16] |
Z. Guo, S. Hu, Z. K. Han, and R. Ouyang, Improving symbolic regression for predicting materials properties with iterative variable selection, J. Chem. Theory Comput. 18(8), 4945 (2022)
CrossRef
ADS
Google scholar
|
[17] |
I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3(Mar), 1157 (2003)
|
[18] |
G.ChandrashekarF.Sahin, A survey on feature selection methods, Comput. Electr. Eng. 40(1), 16 (2014)
|
[19] |
J.ImS.Lee T.W. KoH. W. KimY.HyonH.Chang, Identifying Pb-free perovskites for solar cells by machine learning, npj Comput. Mater. 5, 37 (2019)
|
[20] |
S. Lu, Q. Zhou, L. Ma, Y. Guo, and J. Wang, Rapid discovery of ferroelectric photovoltaic perovskites and material descriptors via machine learning, Small Methods 3(11), 1900360 (2019)
CrossRef
ADS
Google scholar
|
[21] |
Y. Diao, L. Yan, and K. Gao, Improvement of the machine learning-based corrosion rate prediction model through the optimization of input features, Mater. Des. 198, 109326 (2021)
CrossRef
ADS
Google scholar
|
[22] |
Y.XuQ.Qian, i-SISSO: Mutual information-based improved sure independent screening and sparsifying operator algorithm, Eng. Appl. Artif. Intell. 116, 105442 (2022)
|
[23] |
A. Kraskov, H. Stögbauer, and P. Grassberger, Estimating mutual information, Phys. Rev. E 69(6), 066138 (2004)
CrossRef
ADS
Google scholar
|
[24] |
H. Peng, F. Long, and C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226 (2005)
CrossRef
ADS
Google scholar
|
[25] |
M. Krzywinski and N. Altman, Classification and regression trees, Nat. Methods 14, 757 (2017)
CrossRef
ADS
Google scholar
|
[26] |
T.Hastie,
|
[27] |
L. Breiman, Random Forests, Mach. Learn. 45(1), 5 (2001)
CrossRef
ADS
Google scholar
|
[28] |
See: materials.springer.com/
|
[29] |
B.E. BoserI. M. GuyonV.N. Vapnik, A training algorithm for optimal margin classifiers, in: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992
|
[30] |
Y. Chong, Y. Huo, S. Jiang, X. Wang, B. Zhang, T. Liu, X. Chen, T. Han, P. E. S. Smith, S. Wang, and J. Jiang, Machine learning of spectra-property relationship for imperfect and small chemistry data, Proc. Natl. Acad. Sci. USA 120(20), e2220789120 (2023)
CrossRef
ADS
Google scholar
|
/
〈 | 〉 |