Farthest point sampling in property designated chemical feature space as an effective strategy for enhancing the machine learning model performance for small scale chemical dataset
Yuze Liu , Lejia Wang , Weigang Zhu , Xi Yu
Journal of Materials Informatics ›› 2025, Vol. 5 ›› Issue (3) : 39
Machine learning (ML) model development in chemistry and materials science often grapples with the challenge of small and imbalanced labeled datasets, a common limitation in experimental studies. These dataset imbalances can precipitate overfitting and diminish model generalization. Our study explores the efficacy of the farthest point sampling (FPS) strategy within targeted chemical feature spaces, demonstrating its capacity to generate well-distributed training sets and consequently enhance model performance. We rigorously evaluate this strategy across various ML models, including artificial neural networks, support vector machines, and random forests, using datasets with target physicochemical properties such as standard boiling points and enthalpy of vaporization. Our findings reveal that FPS-based models consistently surpass randomly sampled models, exhibiting superior predictive accuracy and robustness, alongside a marked reduction in overfitting. This improvement is particularly pronounced in smaller training set, attributable to increased diversity within the training data’s chemical feature space. Consequently, FPS emerges as an effective and adaptable approach for achieving high-performance ML models at reduced cost by limited and biased experimental datasets typical in chemistry and materials science.
Materials informatics / machine learning / farthest point sampling / small dataset / chemical database
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
Ng, W. W. Y.; Yeung, D. S.; Cloete, I. Input sample selection for RBF neural network classification problems using sensitivity measure. In SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483), Washington, USA. Oct 08, 2023. IEEE; 2023. pp. 2593-8. |
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
Bergström, D.; Tiger, M.; Heintz, F. Bayesian optimization for selecting training and valida-tion data for supervised machine learning. In Proceedings of the 31st Annual Workshop of the Swedish Artificial Intelligence Society (SAIS 2019), Umeå, Sweden. Jun 18-19, 2019. https://www.ida.liu.se/divisions/aiics/publications/SAIS-2019-Bayesian-Optimization-Selecting.pdf. (accessed 11 Jun 2025) |
| [30] |
Vaswani, A.; Shazeer, N.; Parmar, N. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA. Curran Associates Inc.; 2017. pp. 6000-10. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. (accessed 11 Jun 2025) |
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
PubChem. National Center for Biotechnology Information. https://pubchem.ncbi.nlm.nih.gov/. (accessed 11 Jun 2025) |
| [38] |
RDKit: Open-source cheminformatics software. https://www.rdkit.org. (accessed 11 Jun 2025) |
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). Association for Computing Machinery; 2016. pp. 785-94. |
| [44] |
|
| [45] |
van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE.J Mach Learn Res2008:2579-605https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf. (accessed 11 Jun 2025) |
/
| 〈 |
|
〉 |