PDF
Abstract
Large language models (LLMs) excel at extracting information from literatures. However, deploying LLMs necessitates substantial computational resources, and security concerns with online LLMs pose a challenge to their wider applications. Herein, we introduce a method for extracting scientific data from unstructured texts using a local LLM, exemplifying its applications to scientific literatures on the topic of on-surface reactions. By combining prompt engineering and multi-step text preprocessing, we show that the local LLM can effectively extract scientific information, achieving a recall rate of 91% and a precision rate of 70%. Moreover, despite significant differences in model parameter size, the performance of the local LLM is comparable to that of GPT-3.5 turbo (81% recall, 84% precision) and GPT-4o (85% recall, 87% precision). The simplicity, versatility, reduced computational requirements, and enhanced privacy of the local LLM makes it highly promising for data mining, with the potential to accelerate the application and development of LLMs across various fields.
Keywords
data mining
/
large language models
/
on-surface synthesis
/
prompt engineering
Cite this article
Download citation ▾
Juan Xiang, Yizhang Li, Xinyi Zhang, Yu He, Qiang Sun.
Local large language model-assisted literature mining for on-surface reactions.
Materials Genome Engineering Advances, 2025, 3(1): e88 DOI:10.1002/mgea.88
| [1] |
Meng K, Huang C, Wang Y, et al. BNM-CDGNN: batch normalization multilayer perceptron crystal distance graph neural network for excellent-performance crystal property prediction. J Chem Inf Model. 2023; 63(19): 6043-6052.
|
| [2] |
Xie T, Grossman JC. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys Rev Lett. 2018; 120(14):145301.
|
| [3] |
Xie J. Prospects of materials genome engineering frontiers. Mater Genome Eng Adv. 2023; 1(2): e17.
|
| [4] |
Tian Z, Yang Y, Zhou S, et al. High-dimensional Bayesian optimization for metamaterial design. Mater Genome Eng Adv. 2024; 2(4).
|
| [5] |
Lin J, Ban T, Li T, et al. Machine-learning-assisted intelligent synthesis of UiO-66(Ce): balancing the trade-off between structural defects and thermal stability for efficient hydrogenation of Dicyclopentadiene. Mater Genome Eng Adv. 2024; 2(3).
|
| [6] |
Zhao J, Lai J, Wang J, et al. Accelerating spin Hall conductivity predictions via machine learning. Mater Genome Eng Adv. 2024; 2(4): e67.
|
| [7] |
Kononova O, He T, Huo H, Trewartha A, Olivetti EA, Ceder G. Opportunities and challenges of text mining in materials research. iScience. 2021; 24(3):102155.
|
| [8] |
Moed, H; Glänzel, W; Schmoch, U Handbook of quantitative science and technology research: the use of publication and patent statistics in studies of S&T systems; 2005.
|
| [9] |
Lauriola I, Lavelli A, Aiolli F. An introduction to deep learning in natural language processing: models, techniques, and tools. Neurocomputing. 2022; 470: 443-456.
|
| [10] |
Olivetti EA, Cole JM, Kim E, et al. Data-driven materials research enabled by natural language processing and information extraction. Appl Phys Rev. 2020; 7(4).
|
| [11] |
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. ArXiv. 2014:1409.
|
| [12] |
He X, Haffari G, Norouzi M. Sequence to sequence mixture model for diverse machine translation. arXiv Prepr arXiv:1810.07391. 2018: 583-592.
|
| [13] |
Bahar P, Brix C, Ney H. Towards two-dimensional sequence to sequence model in neural machine translation. arXiv Prepr arXiv:1810.03975. 2018: 3009-3015.
|
| [14] |
Shi T, Keneshloo Y, Ramakrishnan N, Reddy CK. Neural abstractive text summarization with sequence-to-sequence models. ACM Trans Data Sci. 2021; 2(1): 1-37.
|
| [15] |
Palasundram K, Sharef NM, Nasharuddin N, Kasmiran K, Azman A. Sequence to sequence model performance for education chatbot. Int J Emerg Technol Learn (iJET). 2019; 14(24): 56-68.
|
| [16] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017.
|
| [17] |
Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding; 2018.
|
| [18] |
Guo J, Ibanez Lopez A, Gao H, et al. Automated chemical reaction extraction from scientific literature. J Chem Inf Model. 2021; 62(9): 2035-2045.
|
| [19] |
Trewartha A, Walker N, Huo H, et al. Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. Patterns. 2022; 3(4):100488.
|
| [20] |
Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 2021; 38(8): 2102-2110.
|
| [21] |
Hellert T, Montenegro J, Pollastro A. PhysBERT: a text embedding model for physics scientific literature. arXiv Prepr arXiv:2408.09574. 2024; 2(4).
|
| [22] |
Dalla Torre H, Gonzalez L, Mendoza Revilla J, et al. The nucleotide transformer: building and evaluating robust foundation models for human genomics. BioRxiv. 2023; 2023. 2001. 2011.523679.
|
| [23] |
Ostendorff M, Rethmeier N, Augenstein I, Gipp B, Rehm G. Neighborhood contrastive learning for scientific document representations with citation embeddings. arXiv Prepr arXiv:2202.06671. 2022.
|
| [24] |
Swain MC, Cole JM. ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature. J Chem Inf Model. 2016; 56(10): 1894-1904.
|
| [25] |
Hawizy L, Jessop D, Adams N, Murray Rust P. ChemicalTagger: a tool for semantic text-mining in chemistry. J cheminformatics. 2011; 3(1):17.
|
| [26] |
Corbett P, Boyle J. Chemlistem: chemical named entity recognition using recurrent neural networks. J Cheminformatics. 2018; 10(1):59.
|
| [27] |
Rocktäschel T, Weidlich M, Leser U. ChemSpot: a hybrid system for chemical named entity recognition. Bioinforma Oxf Engl. 2012; 28(12): 1633-1640.
|
| [28] |
Jessop D, Adams S, Willighagen E, Hawizy L, Murray Rust P. OSCAR4: a flexible architecture for chemical textmining. J cheminformatics. 2011; 3(1):41.
|
| [29] |
Li S, Zhang Y, Fang Z, et al. Extracting the synthetic route of Pd-based catalysts in methanol steam reforming from the scientific literature. J Chem Inf Model. 2023; 63(20): 6249-6260.
|
| [30] |
Huang S, Cole J. BatteryBERT: a pretrained Language Model for battery database enhancement. J Chem Inf Model. 2022; 62(24): 6365-6377.
|
| [31] |
Bai X, Xie Y, Zhang X, Han H, Li JR. Evaluation of open-source Large Language models for metal-organic frameworks research. J Chem Inf Model. 2024; 64(13): 4958-4965.
|
| [32] |
Dagdelen J, Dunn A, Lee S, et al. Structured information extraction from scientific text with large language models. Nat Commun. 2024; 15(1):1418.
|
| [33] |
Xie T, Li Q, Zhang J, Zhang Y, Liu Z, Wang H. Empirical study of zero-shot NER with ChatGPT. In: Conference on Empirical Methods in Natural Language Processing; 2023.
|
| [34] |
Park C, Lee H, Jeong Or. Leveraging medical knowledge graphs and Large Language models for enhanced mental disorder information extraction. Future Internet. 2024; 16(8):260.
|
| [35] |
Khot T, Trivedi H, Finlayson M, et al. Decomposed prompting: a modular approach for solving complex tasks. ArXiv. 2022. abs/2210.02406.
|
| [36] |
Polak MP, Morgan D. Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nat Commun. 2024; 15(1):1569.
|
| [37] |
Filienko D, Wang Y, Jazmi CE, et al. Toward Large Language models as a therapeutic tool: comparing prompting techniques to improve GPT-delivered problem-solving therapy. arXiv Prepr arXiv:2409.00112. 2024.
|
| [38] |
White AD, Hocky GM, Gandhi HA, et al. Assessment of chemistry knowledge in large language models that generate code. Digit Discov. 2023; 2(2): 368-376.
|
| [39] |
Liu Z, Zhao C, Iandola F, et al. MobileLLM: optimizing sub-billion parameter language models for on-device use cases. arXiv:2402.14905. 2024.
|
| [40] |
Das BC, Amini MH, Wu Y. Security and privacy challenges of Large Language models: a survey. ArXiv. 2024. abs/2402.00888.
|
| [41] |
Yao Y, Duan J, Xu K, Cai Y, Sun Z, Zhang Y. A survey on large language model (LLM) security and privacy: the Good, the Bad, and the Ugly. High-Confidence Comput. 2024; 4(2):100211.
|
| [42] |
Clair S, de Oteyza DG. Controlling a chemical coupling reaction on a surface: tools and strategies for on-surface synthesis. Chem Rev. 2019; 119(7): 4717-4776.
|
| [43] |
Grill L, Hecht S. Covalent on-surface polymerization. Nat Chem. 2020; 12(2): 115-130.
|
| [44] |
Qie B, Wang Z, Jiang J, et al. Synthesis and characterization of low-dimensional N-heterocyclic carbene lattices. Science. 2024; 384(6698): 895-901.
|
| [45] |
Chenxiao Z, Bhagwandin D, Xu W, et al. Dramatic acceleration of the hopf cyclization on gold(111): from enediynes to peri-fused diindenochrysene graphene nanoribbons. J Am Chem Soc. 2024; 146(4): 2474-2483.
|
| [46] |
Chi L, Wang L, Han Y, et al. Synthesis of hexabenzocoronene-cored graphdiyne nanosheets through dehydrogenative coupling on Au(111) surface. Angew Chem. 2024; 136(45):e202411722.
|
| [47] |
Bonifazi D, Deyerling J, Berna BB, et al. Solution vs on-surface synthesis of peripherally oxygen-annulated porphyrins through C-O bond formation. Angew Chem Int Ed. 2024; n/a(n/a):e202412978.
|
| [48] |
Jiang H, He Y, Lu J, et al. Unraveling the mechanisms of on-surface photoinduced reaction with polarized light excitations. ACS Nano. 2024; 18(1): 1118-1125.
|
| [49] |
Jiang H, Lu J, Zheng F, Zhu Z, Yan Y, Sun Q. Steering on-surface polymerization through coordination with a bidentate ligand. Chem Commun. 2023; 59(52): 8067-8070.
|
| [50] |
Gu Y, Dong L, Wei F, Huang M. Knowledge distillation of Large Language models. ArXiv. 2023. abs/2306.08543.
|
| [51] |
Zhu X, Li J, Liu Y, Ma C, Wang W. A survey on model compression for Large Language models. ArXiv. 2023. abs/2308.07633.
|
| [52] |
Xu X, Li M, Tao C, et al. A survey on knowledge distillation of Large Language models. ArXiv. 2024. abs/2402.13116.
|
| [53] |
https://huggingface.co/TheBloke/Nous-Hermes-Llama2-GGUF
|
| [54] |
https://huggingface.co/meta-llama/Llama-2-13b
|
| [55] |
Zheng Z, Zhang O, Borgs C, Chayes JT, Yaghi OM. ChatGPT chemistry assistant for text mining and the prediction of MOF synthesis. J Am Chem Soc. 2023; 145(32): 18048-18062.
|
| [56] |
Smeaton AF. Progress in the application of natural language processing to information retrieval tasks. Comput J. 1992; 35(3): 268-278.
|
| [57] |
Yan Y, Zheng F, Zhu Z, Lu J, Jiang H, Sun Q. On-surface synthesis of ethers through dehydrative coupling of hydroxymethyl substituents. Phys Chem Chem Phys. 2022; 24(36): 22122-22128.
|
| [58] |
Cai J, Ruffieux P, Jaafar R, et al. Atomically precise bottom-up fabrication of graphene nanoribbons. Nature. 2010; 466(7305): 470-473.
|
| [59] |
Zwaneveld NAA, Pawlak R, Abel M, et al. Organized Formation of 2D extended covalent organic frameworks at surfaces. J Am Chem Soc. 2008; 130(21): 6678-6679.
|
| [60] |
Sun Q, Zhang C, Li Z, et al. On-surface formation of one-dimensional polyphenylene through bergman cyclization. J Am Chem Soc. 2013; 135(23): 8448-8451.
|
| [61] |
Gao HY, Held PA, Amirjalayer S, et al. Intermolecular On-Surface σ-Bond Metathesis. J Am Chem Soc. 2017; 139(20): 7012-7019.
|
| [62] |
Kanuru VK, Kyriakou G, Beaumont SK, Papageorgiou AC, Watson DJ, Lambert RM. Sonogashira Coupling on an Extended Gold Surface in Vacuo: reaction of Phenylacetylene with Iodobenzene on Au(111). J Am Chem Soc. 2010; 132(23): 8081-8086.
|
| [63] |
https://github.com/CederGroupHub/MatEntityRecognition
|
| [64] |
Grill L, Dyer M, Lafferentz L, Persson M, Peters M, Hecht S. Nano-architectures by covalent assembly of molecular building blocks. Nat Nanotechnol. 2007; 2(11): 687-691.
|
| [65] |
Sun Q, Cai L, Ma H, Yuan C, Xu W. The stereoselective synthesis of dienes through dehalogenative homocoupling of terminal alkenyl bromides on Cu(110). Chem Commun. 2016; 52(35): 6009-6012.
|
RIGHTS & PERMISSIONS
2025 The Author(s). Materials Genome Engineering Advances published by Wiley-VCH GmbH on behalf of University of Science and Technology Beijing.