Farthest point sampling in property designated chemical feature space as an effective strategy for enhancing the machine learning model performance for small scale chemical dataset

Yuze Liu , Lejia Wang , Weigang Zhu , Xi Yu

Journal of Materials Informatics ›› 2025, Vol. 5 ›› Issue (3) : 39

PDF
Journal of Materials Informatics ›› 2025, Vol. 5 ›› Issue (3) :39 DOI: 10.20517/jmi.2025.10
Research Article
Farthest point sampling in property designated chemical feature space as an effective strategy for enhancing the machine learning model performance for small scale chemical dataset
Author information +
History +
PDF

Abstract

Machine learning (ML) model development in chemistry and materials science often grapples with the challenge of small and imbalanced labeled datasets, a common limitation in experimental studies. These dataset imbalances can precipitate overfitting and diminish model generalization. Our study explores the efficacy of the farthest point sampling (FPS) strategy within targeted chemical feature spaces, demonstrating its capacity to generate well-distributed training sets and consequently enhance model performance. We rigorously evaluate this strategy across various ML models, including artificial neural networks, support vector machines, and random forests, using datasets with target physicochemical properties such as standard boiling points and enthalpy of vaporization. Our findings reveal that FPS-based models consistently surpass randomly sampled models, exhibiting superior predictive accuracy and robustness, alongside a marked reduction in overfitting. This improvement is particularly pronounced in smaller training set, attributable to increased diversity within the training data’s chemical feature space. Consequently, FPS emerges as an effective and adaptable approach for achieving high-performance ML models at reduced cost by limited and biased experimental datasets typical in chemistry and materials science.

Keywords

Materials informatics / machine learning / farthest point sampling / small dataset / chemical database

Cite this article

Download citation ▾
Yuze Liu, Lejia Wang, Weigang Zhu, Xi Yu. Farthest point sampling in property designated chemical feature space as an effective strategy for enhancing the machine learning model performance for small scale chemical dataset. Journal of Materials Informatics, 2025, 5(3): 39 DOI:10.20517/jmi.2025.10

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Jordan MI.Machine learning: trends, perspectives, and prospects.Science2015;349:255-60

[2]

Butler KT,Cartwright H,Walsh A.Machine learning for molecular and materials science.Nature2018;559:547-55

[3]

Keith JA,Cheng B.Combining machine learning and computational chemistry for predictive insights into chemical systems.Chem Rev2021;121:9816-72 PMCID:PMC8391798

[4]

Shi X,Lu Y.Applications of machine learning in electrochemistry.Renewables2023;1:668-93

[5]

Jiang Y,Guo J.Coupling complementary strategy to flexible graph neural network for quick discovery of coformer in diverse co-crystal materials.Nat Commun2021;12:5950 PMCID:PMC8511140

[6]

Chong Y,Jiang S.Machine learning of spectra-property relationship for imperfect and small chemistry data.Proc Natl Acad Sci U S A2023;120:e2220789120 PMCID:PMC10193941

[7]

Wang X,Hu W.Quantitatively determining surface-adsorbate properties from vibrational spectroscopy with interpretable machine learning.J Am Chem Soc2022;144:16069-76

[8]

Ren H,Wang Z.Machine learning recognition of protein secondary structures based on two-dimensional spectroscopic descriptors.Proc Natl Acad Sci U S A2022;119:e2202713119 PMCID:PMC9171355

[9]

Chen A,Zhou Z.Machine learning: accelerating materials development for energy storage and conversion.InfoMat2020;2:553-76

[10]

Sun Z,Liu K.Machine learning accelerated calculation and design of electrocatalysts for CO2 reduction.SmartMat2022;3:68-83

[11]

Lin M,Xiang Y.Unravelling the fast alkali-ion dynamics in paramagnetic battery materials combined with NMR and deep-potential molecular dynamics simulation.Angew Chem Int Ed Engl2021;60:12547-53

[12]

Sanchez-Lengeling B.Inverse molecular design using machine learning: generative models for matter engineering.Science2018;361:360-5

[13]

Wang AY,Kauwe SK.Machine learning for materials scientists: an introductory guide toward best practices.Chem Mater2020;32:4954-65

[14]

Xu P,Li M.Small data machine learning in materials science.npj Comput Mater2023;9:1000

[15]

Dou B,Merkurjev E.Machine learning methods for small data challenges in molecular science.Chem Rev2023;123:8736-80 PMCID:PMC10999174

[16]

Guo H,Shang J,Huang Y.Learning from class-imbalanced data: review of methods and applications.Exp Syst Appl2017;73:220-39

[17]

Xu X,Zhu J,Sun T.Review of classical dimensionality reduction and sample selection methods for large-scale data processing.Neurocomputing2019;328:5-15

[18]

Willett P.Dissimilarity-based algorithms for selecting structurally diverse sets of compounds.J Comput Biol1999;6:447-57

[19]

Pereira T,Oliveira JL,Arrais J.Optimizing blood-brain barrier permeation through deep reinforcement learning for de novo drug design.Bioinformatics2021;37:i84-92 PMCID:PMC8336597

[20]

Lu T,Li M,Lu W.Predicting experimental formability of hybrid organic-inorganic perovskites via imbalanced learning.J Phys Chem Lett2022;13:3032-8

[21]

Mazouin B,von Lilienfeld OA.Selected machine learning of HOMO-LUMO gaps with improved data-efficiency.Mater Adv2022;3:8306-16 PMCID:PMC9662596

[22]

Akdemir D,Jannink JL.Optimization of genomic selection training populations with a genetic algorithm.Genet Sel Evol2015;47:38 PMCID:PMC4422310

[23]

Miranda-Quintana RA,Rácz A.Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 1: Theory and characteristics.J Cheminform2021;13:32 PMCID:PMC8067658

[24]

Miranda-Quintana RA,Bajusz D.Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 2: speed, consistency, diversity selection.J Cheminform2021;13:33 PMCID:PMC8067665

[25]

Ng, W. W. Y.; Yeung, D. S.; Cloete, I. Input sample selection for RBF neural network classification problems using sensitivity measure. In SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483), Washington, USA. Oct 08, 2023. IEEE; 2023. pp. 2593-8.

[26]

Smith JS,Lubbers N,Roitberg AE.Less is more: sampling chemical space with active learning.J Chem Phys2018;148:241733

[27]

Botu V.Adaptive machine learning framework to accelerate ab initio molecular dynamics.Int J Quantum Chem2015;115:1074-83

[28]

Gastegger M,Marquetand P.Machine learning molecular dynamics for the simulation of infrared spectra.Chem Sci2017;8:6924-35 PMCID:PMC5636952

[29]

Bergström, D.; Tiger, M.; Heintz, F. Bayesian optimization for selecting training and valida-tion data for supervised machine learning. In Proceedings of the 31st Annual Workshop of the Swedish Artificial Intelligence Society (SAIS 2019), Umeå, Sweden. Jun 18-19, 2019. https://www.ida.liu.se/divisions/aiics/publications/SAIS-2019-Bayesian-Optimization-Selecting.pdf. (accessed 11 Jun 2025)

[30]

Vaswani, A.; Shazeer, N.; Parmar, N. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA. Curran Associates Inc.; 2017. pp. 6000-10. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. (accessed 11 Jun 2025)

[31]

Ross J,Chenthamarakshan V,Mroueh Y.Large-scale chemical language representations capture molecular structure and properties.Nat Mach Intell2022;4:1256-64

[32]

Lu S,He D,Ke G.Data-driven quantum chemical property prediction leveraging 3D conformations with Uni-Mol.Nat Commun2024;15:7104 PMCID:PMC11333583

[33]

Eldar Y,Porat M.The farthest point strategy for progressive image sampling.IEEE Trans Image Process1997;6:1305-15

[34]

Charles RQ,Kaichun M.PointNet: deep learning on point sets for 3D classification and segmentation.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017. pp. 77-85

[35]

Cersonsky RK,Engel EA,Ceriotti M.Improving sample and feature selection with principal covariates regression.Mach Learn Sci Technol2021;2:035038

[36]

Yaws CL. Yaws’ critical property data for chemical engineers and chemists. Knovel; 2012. http://app.knovel.com/hotlink/toc/id:kpYCPDCECD/yaws-critical-property/yaws-critical-property. (accessed 11 Jun 2025)

[37]

PubChem. National Center for Biotechnology Information. https://pubchem.ncbi.nlm.nih.gov/. (accessed 11 Jun 2025)

[38]

RDKit: Open-source cheminformatics software. https://www.rdkit.org. (accessed 11 Jun 2025)

[39]

Mauri A.alvaDesc: a tool to calculate and analyze molecular descriptors and fingerprints. In: Roy K, editor. Ecotoxicological QSARs. New York: Springer US; 2020. pp. 801-20.

[40]

Liu Y,Huang J,Hu W.Accurate prediction of the boiling point of organic molecules by multi-component heterogeneous learning model.Acta Chim Sin2022;80:714-23

[41]

Bishop CM. Pattern recognition and machine learning. Springer: New York, NY; 2006. https://link.springer.com/book/9780387310732. (accessed 11 Jun 2025)

[42]

Viering T.The shape of learning curves: a review.IEEE Trans Pattern Anal Mach Intell2023;45:7799-819

[43]

Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). Association for Computing Machinery; 2016. pp. 785-94.

[44]

He H.Learning from Imbalanced Data.IEEE Trans Knowl Data Eng2009;21:1263-84

[45]

van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE.J Mach Learn Res2008:2579-605https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf. (accessed 11 Jun 2025)

PDF

282

Accesses

0

Citation

Detail

Sections
Recommended

/