Towards uncertainty-calibrated structural data enrichment with large language model for few-shot entity resolution

Mengyi YAN , Yaoshu WANG , Xiaohan JIANG , Haoyi ZHOU , Jianxin LI

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (11) : 1911376

PDF (3987KB)
Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (11) : 1911376 DOI: 10.1007/s11704-025-41143-4
Artificial Intelligence
RESEARCH ARTICLE

Towards uncertainty-calibrated structural data enrichment with large language model for few-shot entity resolution

Author information +
History +
PDF (3987KB)

Abstract

Entity Resolution (ER) is vital for data integration and knowledge graph construction. Despite the advancements made by deep learning (DL) methods using pre-trained language models (PLMs), these approaches often struggle with unstructured, long-text entities (ULE) in real-world scenarios, where critical information is scattered across the text, and existing DL methods require extensive human labeling and computational resources. To tackle these issues, we propose a Few-shot Uncertainty-calibrated Structural data Enrichment method for ER (FUSER). FUSER applies unsupervised pairwise enrichment to extract structural attributes from unstructured entities via Large Language Models (LLMs), and integrates an uncertainty-based calibration module to reduce hallucination issues with minimal additional inference cost. It also implements a lightweight ER pipeline that efficiently performs both blocking and matching tasks with as few as 50 labeled positive samples. FUSER was evaluated on six ER benchmark datasets featuring ULE entities, outperforming state-of-the-art methods and significantly boosting the performance of existing ER approaches through its data enrichment component, with a 10× speedup in uncertainty quantification for LLMs compared to baseline methods, demonstrating its efficiency and effectiveness in real-world applications.

Graphical abstract

Keywords

entity resolution / large language model / database / uncertainty qualification / entity matching

Cite this article

Download citation ▾
Mengyi YAN, Yaoshu WANG, Xiaohan JIANG, Haoyi ZHOU, Jianxin LI. Towards uncertainty-calibrated structural data enrichment with large language model for few-shot entity resolution. Front. Comput. Sci., 2025, 19(11): 1911376 DOI:10.1007/s11704-025-41143-4

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Liu Y. RoBERTa: a robustly optimized BERT pretraining approach. 2019, arXiv preprint arXiv: 1907.11692

[2]

Li Y, Li J, Suhara Y, Doan A, Tan W C . Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment, 2020, 14( 1): 50–60

[3]

Miao Z, Li Y, Wang X. Rotom: a meta-learned data augmentation framework for entity matching, data cleaning, text classification, and beyond. In: Proceedings of 2021 International Conference on Management of Data. 2021, 1303−1316

[4]

Wang P, Zeng X, Chen L, Ye F, Mao Y, Zhu J, Gao Y . PromptEM: prompt-tuning for low-resource generalized entity matching. Proceedings of the VLDB Endowment, 2022, 16( 2): 369–378

[5]

Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 3982−3992

[6]

Mudgal S, Li H, Rekatsinas T, Doan A, Park Y, Krishnan G, Deep R, Arcaute E, Raghavendra V. Deep learning for entity matching: a design space exploration. In: Proceedings of 2018 International Conference on Management of Data. 2018, 19−34

[7]

Wang J, Li Y, Hirota W. Machamp: a generalized entity matching benchmark. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021, 4633−4642

[8]

Chaudhury S, Dan S, Das P, Kollias G, Nelson E. Needle in the haystack for memory based large language models. 2024, arXiv preprint arXiv: 2407.01437

[9]

Tang J, Dou W, Shen D, Nie T, Kou Y. Towards long-text entity resolution with chain-of-thought knowledge augmentation from large language models. In: Proceedings of the 29th International Conference on Database Systems for Advanced Applications. 2024, 322−336

[10]

Wu R, Chaba S, Sawlani S, Chu X, Thirumuruganathan S. ZeroER: entity resolution using zero labeled examples. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 1149−1164

[11]

He J, Pan K, Dong X, Song Z, LiuYiBo L, Qianguosun Q, Liang Y, Wang H, Zhang E, Zhang J. Never lost in the middle: mastering long-context question answering with position-agnostic decompositional training. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 13628−13642

[12]

Liu N F, Lin K, Hewitt J, Paranjape A, Bevilacqua M, Petroni F, Liang P . Lost in the middle: how language models use long contexts. Transactions of the Association for Computational Linguistics, 2024, 12: 157–173

[13]

Xu D, Chen W, Peng W, Zhang C, Xu T, Zhao X, Wu X, Zheng Y, Wang Y, Chen E . Large language models for generative information extraction: a survey. Frontiers of Computer Science, 2024, 18( 6): 186357

[14]

Singh I S, Aggarwal R, Allahverdiyev I, Taha M, Akalin A, Zhu K, O’Brien S. ChunkRAG: novel LLM-chunk filtering method for rag systems. 2024, arXiv preprint arXiv: 2410.19572

[15]

Zhang H, Dong Y, Xiao C, Oyamada M. Jellyfish: instruction-tuning local large language models for data preprocessing. In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing. 2024, 8754−8782

[16]

Narayan A, Chami I, Orr L, Ré C. Can foundation models wrangle your data? Proceedings of the VLDB Endowment, 2022, 16(4): 738−746

[17]

Cardie C . Empirical methods in information extraction. AI Magazine, 1997, 18( 4): 65–79

[18]

Wu H, Yuan Y, Mikaelyan L, Meulemans A, Liu X, Hensman J, Mitra B. Structured entity extraction using large language models. 2024, arXiv preprint arXiv: 2402.04437

[19]

Yang Y, Huang P, Cao J, Li J, Lin Y, Ma F . A prompt-based approach to adversarial example generation and robustness enhancement. Frontiers of Computer Science, 2024, 18( 4): 184318

[20]

Wu Y, Yang X . A glance at in-context learning. Frontiers of Computer Science, 2024, 18( 5): 185347

[21]

Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, Chen Q, Peng W, Feng X, Qin B, Liu T . A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 2025, 43( 2): 42

[22]

Papadakis G, Skoutas D, Thanos E, Palpanas T . Blocking and filtering techniques for entity resolution: a survey. ACM Computing Surveys, 2021, 53( 2): 31

[23]

Fan W, Jia X, Li J, Ma S . Reasoning about record matching rules. Proceedings of the VLDB Endowment, 2009, 2( 1): 407–418

[24]

Kejriwal M, Miranker D P. A DNF blocking scheme learner for heterogeneous datasets. 2015, arXiv preprint arXiv: 1501.01694

[25]

Papadakis G, Koutrika G, Palpanas T, Nejdl W . Meta-blocking: taking entity resolutionto the next level. IEEE Transactions on Knowledge and Data Engineering, 2014, 26( 8): 1946–1960

[26]

Singh R, Meduri V V, Elmagarmid A, Madden S, Papotti P, Quiané-Ruiz J A, Solar-Lezama A, Tang N . Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment, 2017, 11( 2): 189–202

[27]

Paulsen D, Govind Y, Doan A . Sparkly: a simple yet surprisingly strong TF/IDF blocker for entity matching. Proceedings of the VLDB Endowment, 2023, 16( 6): 1507–1519

[28]

Paul Suganthan G C, Ardalan A, Doan A, Akella A. Smurf: self-service string matching using random forests. Proceedings of the VLDB Endowment, 2018, 12(3): 278−291

[29]

Efthymiou V, Papadakis G, Papastefanatos G, Stefanidis K, Palpanas T. Parallel meta-blocking: realizing scalable entity resolution over large, heterogeneous data. In: Proceedings of 2015 IEEE International Conference on Big Data (Big Data). 2015, 411−420

[30]

Thirumuruganathan S, Li H, Tang N, Ouzzani M, Govind Y, Paulsen D, Fung G, Doan A . Deep learning for blocking in entity matching: a design space exploration. Proceedings of the VLDB Endowment, 2021, 14( 11): 2459–2472

[31]

Wang R, Li Y, Wang J. Sudowoodo: contrastive self-supervised learning for multi-purpose data integration and preparation. In: Proceedings of the 39th IEEE International Conference on Data Engineering. 2023, 1502−1515

[32]

Brinkmann A, Shraga R, Bizer C. SC-block: supervised contrastive blocking within entity resolution pipelines. In: Proceedings of the 21st International Conference on the Semantic Web. 2024, 121−142

[33]

Wu S, Wu Q, Dong H, Hua W, Zhou X . Blocker and matcher can mutually benefit: a co-learning framework for low-resource entity resolution. Proceedings of the VLDB Endowment, 2023, 17( 3): 292–304

[34]

Wang T, Lin H, Han X, Chen X, Cao B, Sun L. Towards universal dense blocking for entity resolution. 2024, arXiv preprint arXiv: 2404.14831

[35]

Guo S, Dong X L, Srivastava D, Zajac R. Record linkage with uniqueness constraints and erroneous values. Proceedings of the VLDB Endowment, 2010, 3(1−2): 417−428

[36]

Fan W, Gao H, Jia X, Li J, Ma S . Dynamic constraints for record matching. The VLDB Journal, 2011, 20( 4): 495–520

[37]

Whang S E, Garcia-Molina H . Joint entity resolution on multiple datasets. The VLDB Journal, 2013, 22( 6): 773–795

[38]

Konda P, Das S, Paul Suganthan G C, Doan A, Ardalan A, Ballard J R, Li H, Panahi F, Zhang H, Naughton J, Prasad S, Krishnan G, Deep R, Raghavendra V. Magellan: toward building entity matching management systems. Proceedings of the VLDB Endowment, 2016, 9(12): 1197−1208

[39]

Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, Tang N . Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, 2018, 11( 11): 1454–1467

[40]

Zhao C, He Y. Auto-EM: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: Proceedings of the World Wide Web Conference. 2019, 2413−2424

[41]

Li B, Wang W, Sun Y, Zhang L, Ali M A, Wang Y. GraphER: token-centric entity resolution with graph convolutional neural networks. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 8172−8179

[42]

Fu C, Han X, Sun L, Chen B, Zhang W, Wu S, Kong H. End-to-end multi-perspective matching for entity resolution. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019, 4961−4967

[43]

Tang J, Song R, Huang Y, Gao S, Yu Z . Semantic-aware entity alignment for low resource language knowledge graph. Frontiers of Computer Science, 2024, 18( 4): 184319

[44]

Zeng X, Wang P, Mao Y, Chen L, Liu X, Gao Y. MultiEM: efficient and effective unsupervised multi-table entity matching. In: Proceedings of the 40th IEEE International Conference on Data Engineering. 2024, 3421−3434

[45]

Kirielle N, Christen P, Ranbaduge T. TransER: homogeneous transfer learning for entity resolution. In: Proceedings of the 25th International Conference on Extending Database Technology. 2022, 118−130

[46]

Tu J, Han X, Fan J, Tang N, Chai C, Li G, Du X . DADER: hands-off entity resolution with domain adaptation. Proceedings of the VLDB Endowment, 2022, 15( 12): 3666–3669

[47]

Sun C, Xu Y, Shen D, Nie T. Matching feature separation network for domain adaptation in entity matching. In: Proceedings of the ACM Web Conference 2024. 2024, 1975−1985

[48]

Loster M, Koumarelas I, Naumann F . Knowledge transfer for entity resolution with Siamese neural networks. Journal of Data and Information Quality (JDIQ), 2021, 13( 1): 2

[49]

Fan J, Tu J, Li G, Wang P, Du X, Jia X, Gao S, Tang N . Unicorn: a unified multi-tasking matching model. ACM SIGMOD Record, 2024, 53( 1): 44–53

[50]

Li B, Miao Y, Wang Y, Sun Y, Wang W. Improving the efficiency and effectiveness for BERT-based entity resolution. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 13226−13233

[51]

Li P, He Y, Yashar D, Cui W, Ge S, Zhang H, Rifinski Fainman D, Zhang D, Chaudhuri S . Table-GPT: table fine-tuned GPT for diverse table tasks. Proceedings of the ACM on Management of Data, 2024, 2( 3): 176

[52]

Wang T, Chen X, Lin H, Chen X, Han X, Sun L, Wang H, Zeng Z. Match, compare, or select? An investigation of large language models for entity matching. In: Proceedings of the 31st International Conference on Computational Linguistics. 2025, 96−109

[53]

Li H, Feng L, Li S, Hao F, Zhang C J, Song Y, Chen L. On leveraging large language models for enhancing entity resolution: a cost-efficient approach. 2024, arXiv preprint arXiv: 2401.03426

[54]

Gawlikowski J, Tassi C R N, Ali M, Lee J, Humt M, Feng J, Kruspe A, Triebel R, Jung P, Roscher R, Shahzad M, Yang W, Bamler R, Zhu X X . A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 2023, 56( S1): 1513–1589

[55]

Kendall A, Gal Y. What uncertainties do we need in Bayesian deep learning for computer vision? In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 5580−5590

[56]

Kadavath S, Conerly T, Askell A, Henighan T, Drain D, , . Language models (mostly) know what they know. 2022, arXiv preprint arXiv: 2207.05221

[57]

Zhao X, Chen F, Hu S, Cho J H. Uncertainty aware semi-supervised learning on graph data. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1076

[58]

Liu Q, Zhang Q, Zhao F, Wang G . Uncertain knowledge graph embedding: an effective method combining multi-relation and multi-path. Frontiers of Computer Science, 2024, 18( 3): 183311

[59]

Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J D, , . Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 159

[60]

Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971

[61]

Jiang A Q, Sablayrolles A, Mensch A, Bamford C, Chaplot D S, de las Casas D, Bressand F, Lengyel G, Lample G, Saulnier L, Lavaud L R, Lachaux M A, Stock P, Scao T L, Lavril T, Wang T, Lacroix T, Sayed W E. Mistral 7B. 2023, arXiv preprint arXiv: 2310.06825

[62]

Xiao Y, Liang P P, Bhatt U, Neiswanger W, Salakhutdinov R, Morency L P. Uncertainty quantification with pre-trained language models: a large-scale empirical analysis. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2022. 2022, 7273−7284

[63]

Lin S, Hilton J, Evans O. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research, 2022

[64]

Manakul P, Liusie A, Gales M. SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 9004−9017

[65]

Malinin A, Gales M. Uncertainty estimation in autoregressive structured prediction. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[66]

Li M, Shi X, Qiao C, Huang X, Wang W, Wan Y, Zhang T, Jin H . E2CNN: entity-type-enriched cascaded neural network for Chinese financial relation extraction. Frontiers of Computer Science, 2025, 19( 10): 1910352

[67]

Huang Y, Song J, Wang Z, Zhao S, Chen H, Juefei-Xu F, Ma L . Look before you leap: an exploratory study of uncertainty analysis for large language models. IEEE Transactions on Software Engineering, 2025, 51( 2): 413–429

[68]

Kuhn L, Gal Y, Farquhar S. Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[69]

Duan J, Cheng H, Wang S, Zavalny A, Wang C, Xu R, Kailkhura B, Xu K. Shifting attention to relevance: towards the predictive uncertainty quantification of free-form large language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 5050−5063

[70]

Yin Z, Sun Q, Guo Q, Wu J, Qiu X, Huang X J. Do large language models know what they don’t know? In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2023. 2023, 8653−8665

[71]

Sui Y, Zhou M, Zhou M, Han S, Zhang D. Table meets LLM: can large language models understand structured table data? A benchmark and empirical study. In: Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 2024, 645−654

[72]

Krell M M, Kosec M, Perez S P, Fitzgibbon A. Efficient sequence packing without cross-contamination: accelerating large language models without impacting performance. 2022, arXiv preprint arXiv: 2107.02027

[73]

Luo K, Liu Z, Xiao S, Liu K. BGE landmark embedding: a chunking-free embedding method for retrieval augmented long-context large language models. 2024, arXiv preprint arXiv: 2402.11573

[74]

Liu J. Llamaindex. See GitHub repository (LlamaIndex) website, 2023

[75]

Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 1597−1607

[76]

Kornbrot D. Point biserial correlation. In: Wiley StatsRef: Statistics Reference Online. New York: John Wiley & Sons, 2014

[77]

Brinkmann A, Shraga R, Bizer C. ExtractGPT: exploring the potential of large language models for product attribute value extraction. In: Proceedings of the 26th International Conference on Information Integration and Web Intelligence. 2025, 38−52

[78]

Dong Y, Ruan C F, Cai Y, Lai R, Xu Z, Zhao Y, Chen T. XGrammar: flexible and efficient structured generation engine for large language models. 2024, arXiv preprint arXiv: 2411.15100

[79]

Kwon W, Li Z, Zhuang S, Sheng Y, Zheng L, Yu C H, Gonzalez J, Zhang H, Stoica I. Efficient memory management for large language model serving with PagedAttention. In: Proceedings of the 29th Symposium on Operating Systems Principles. 2023, 611−626

[80]

Farquhar S, Kossen J, Kuhn L, Gal Y . Detecting hallucinations in large language models using semantic entropy. Nature, 2024, 630( 8017): 625–630

[81]

Leviathan Y, Kalman M, Matias Y. Fast inference from transformers via speculative decoding. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 19274−19286

[82]

Fadeeva E, Vashurin R, Tsvigun A, Vazhentsev A, Petrakov S, Fedyanin K, Vasilev D, Goncharova E, Panchenko A, Panov M, Baldwin T, Shelmanov A. LM-polygraph: uncertainty estimation for language models. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023, 446−461

[83]

Zhang J, Fan R, Tao H, Jiang J, Hou C . Constrained clustering with weak label prior. Frontiers of Computer Science, 2024, 18( 3): 183338

[84]

Zheng Y, Zhang R, Zhang J, Ye Y, Luo Z, Feng Z, Ma Y. Llamafactory: unified efficient fine-tuning of 100+ language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2024, 400−410

[85]

Primpeli A, Peeters R, Bizer C. The WDC training dataset and gold standard for large-scale product matching. In: Proceedings of the 2019 World Wide Web Conference. 2019, 381−386

[86]

He P, Liu X, Gao J, Chen W. DeBERTa: decoding-enhanced BERT with disentangled attention. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[87]

Zhang P, Xiao S, Liu Z, Dou Z, Nie J Y. Retrieve anything to augment large language models. 2023, arXiv preprint arXiv: 2310.07554

[88]

Wang X, Wang Z, Gao X, Zhang F, Wu Y, Xu Z, Shi T, Wang Z, Li S, Qian Q, Yin R, Lv C, Zheng X, Huang X. Searching for best practices in retrieval-augmented generation. In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing. 2024, 17716−17736

[89]

Honnibal M, Montani I, Van Landeghem S, Boyd A. spaCy: industrial-strength natural language processing in python. See github.com/explosion/spaCy website, 2020

[90]

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P J . Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21( 1): 140

[91]

Vrandečić D, Krötzsch M . Wikidata: a free collaborative knowledgebase. Communications of the ACM, 2014, 57( 10): 78–85

[92]

Yan M, Fan W, Wang Y, Xie M . Enriching relations with additional attributes for ER. Proceedings of the VLDB Endowment, 2024, 17( 11): 3109–3123

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap
PDF (3987KB)

Supplementary files

Highlights

272

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/