MAD-Fact: a multi-agent debate framework for long-form factuality evaluation in LLMs

Yucheng NING; Xixun LIN; Fang FANG; Yanan CAO

doi:10.1007/s11704-025-51369-x

Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (4) :2104802 DOI: 10.1007/s11704-025-51369-x

Information Security

RESEARCH ARTICLE

MAD-Fact: a multi-agent debate framework for long-form factuality evaluation in LLMs

Yucheng NING ¹^,²
, Xixun LIN ¹
, Fang FANG ¹^,²
, Yanan CAO ¹^,²

Author information +

History +

PDF (5329KB)

Abstract

The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.

Graphical abstract

Keywords

information security / large language model / long-form text generation / factuality evaluation / multi-agent system

Cite this article

Download citation ▾

Yucheng NING, Xixun LIN, Fang FANG, Yanan CAO. MAD-Fact: a multi-agent debate framework for long-form factuality evaluation in LLMs. Front. Comput. Sci., 2027, 21(4): 2104802 DOI:10.1007/s11704-025-51369-x

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, , et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 159

[2]	Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E H, Le Q V, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1800

[3]	Zhang S, Roller S, Goyal N, Artetxe M, Chen M, Chen S, Dewan C, Diab M, Li X, Lin X V, Mihaylov T, Ott M, Shleifer S, Shuster K, Simig D, Koura P S, Sridhar A, Wang T, Zettlemoyer L. Opt: open pre-trained transformer language models. 2022, arXiv preprint arXiv: 2205.01068

[4]	Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971

[5]	OpenAI . GPT-4 technical report. 2024, arXiv preprint arXiv: 2303.08774

[6]	Zhao W X, Zhou K, Li J, Tang T, Wang X, , et al. A survey of large language models. 2025, arXiv preprint arXiv: 2303.18223

[7]	Yang R, Tan T F, Lu W, Thirunavukarasu A J, Ting D S W, Liu N . Large language models in health care: Development, applications, and challenges. Health Care Science, 2023, 2( 4): 255–263

[8]	Valeyre S, Aboura S. LLMs for time series: an application for single stocks and statistical arbitrage. 2025, arXiv preprint arXiv: 2412.09394

[9]	Caballero W N, Jenkins P R . On large language models in national security applications. Stat, 2025, 14( 2): e70057

[10]	Johnson E, Wilson N. Enhancing agricultural machinery management through advanced LLM integration. 2024, arXiv preprint arXiv: 2407.20588

[11]	Shu D, Zhao H, Liu X, Demeter D, Du M, Zhang Y. LawLLM: law large language model for the US legal system. In: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management. 2024, 4882–4889

[12]	Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, Chen Z, Tang J, Chen X, Lin Y, Zhao W X, Wei Z, Wen J . A survey on large language model based autonomous agents. Frontiers of Computer Science, 2024, 18( 6): 186345

[13]	Qu C, Dai S, Wei X, Cai H, Wang S, Yin D, Xu J, Wen J R . Tool learning with large language models: a survey. Frontiers of Computer Science, 2025, 19( 8): 198343

[14]	Xi Z, Chen W, Guo X, He W, Ding Y, . et al. The rise and potential of large language model based agents: a survey. Science China Information Sciences, 2025, 68( 2): 121101

[15]	Zhang Y, Li Y, Cui L, Cai D, Liu L, Fu T, Huang X, Zhao E, Zhang Y, Chen Y, Wang L, Luu A T, Bi W, Shi F, Shi S. Siren’s song in the AI ocean: a survey on hallucination in large language models. Computational Linguistics, 2025

[16]	Rawte V, Sheth A, Das A. A survey of hallucination in large foundation models. 2023, arXiv preprint arXiv: 2309.05922

[17]	Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, Chen Q, Peng W, Feng X, Qin B, Liu T . A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 2025, 43( 2): 42

[18]	Bai Z, Wang P, Xiao T, He T, Han Z, Zhang Z, Shou M Z. Hallucination of multimodal large language models: a survey. 2025, arXiv preprint arXiv: 2404.18930

[19]	Li H, Ye J, Wu J . Privacy dilemmas and opportunities in large language models: a brief review. Frontiers of Computer Science, 2025, 19( 10): 1910356

[20]	Lin X, Ning Y, Zhang J, Dong Y, Liu Y, , et al. LLM-based agents suffer from hallucinations: a survey of taxonomy, methods, and directions. 2025, arXiv preprint arXiv: 2509.18970

[21]	Pal A, Umapathi L K, Sankarasubbu M. Med-HALT: medical domain hallucination test for large language models. In: Proceedings of the 27th Conference on Computational Natural Language Learning. 2023, 314–334

[22]	Cui J, Ning M, Li Z, Chen B, Yan Y, Li H, Ling B, Tian Y, Yuan L. Chatlaw: a multi-agent collaborative legal assistant with knowledge graph enhanced mixture-of-experts large language model. 2024, arXiv preprint arXiv: 2306.16092

[23]	Bhatia G, Nagoudi E M B, Cavusoglu H, Abdul-Mageed M. FinTral: a family of GPT-4 level multimodal financial large language models. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2024. 2024, 13064–13087

[24]	Nan Q, Sheng Q, Cao J, Zhu Y, Wang D, Yang G, Li J . Exploiting user comments for early detection of fake news prior to users’ commenting. Frontiers of Computer Science, 2025, 19( 10): 1910354

[25]	Lee N, Ping W, Xu P, Patwary M, Fung P, Shoeybi M, Catanzaro B. Factuality enhanced language models for open-ended text generation. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2506

[26]

Gao L, Dai Z, Pasupat P, Chen A, Chaganty A T, Fan Y, Zhao V, Lao N, Lee H, Juan D C, Guu K. RARR: researching and revising what language models say, using language models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 16477–16508

[27]	Gao T, Yen H, Yu J, Chen D. Enabling large language models to generate text with citations. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 6465–6488

[28]	Manakul P, Liusie A, Gales M. SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 9004–9017

[29]	Huang Y, Xu J, Lai J, Jiang Z, Chen T, Li Z, Yao Y, Ma X, Yang L, Chen H, Li S, Zhao P. Advancing transformer architecture in long-context large language models: a comprehensive survey. 2024, arXiv preprint arXiv: 2311.12351

[30]	Setty S, Thakkar H, Lee A, Chung E, Vidra N. Improving retrieval for rag based question answering models on financial documents. 2024, arXiv preprint arXiv: 2404.07221

[31]	Min S, Krishna K, Lyu X, Lewis M, Yih W T, Koh P, Iyyer M, Zettlemoyer L, Hajishirzi H. FActScore: fine-grained atomic evaluation of factual precision in long form text generation. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 12076–12100

[32]	Wei J, Yang C, Song X, Lu Y, Hu N, Huang J, Tran D, Peng D, Liu R, Huang D, Du C, Le Q V. Long-form factuality in large language models. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 2567

[33]	Fatahi Bayat F, Zhang L, Munir S, Wang L. FactBench: a dynamic benchmark for in-the-wild language model factuality evaluation. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, 33090–33110

[34]	Song Y, Kim Y, Iyyer M. VeriScore: evaluating the factuality of verifiable claims in long-form text generation. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2024. 2024, 9447–9474

[35]	Xie Z, Xing R, Wang Y, Geng J, Iqbal H, Sahnan D, Gurevych I, Nakov P. FIRE: fact-checking with iterative retrieval and verification. In: Proceedings of Findings of the Association for Computational Linguistics: NAACL 2025. 2025, 2901–2914

[36]	Cheng Q, Sun T, Zhang W, Wang S, Liu X, Zhang M, He J, Huang M, Yin Z, Chen K, Qiu X. Evaluating hallucinations in Chinese large language models. 2023, arXiv preprint arXiv: 2310.03368

[37]

He Y, Li S, Liu J, Tan Y, Wang W, Huang H, Bu X, Guo H, Hu C, Zheng B, Lin Z, Sun D, Zheng Z, Su W, Zheng B. Chinese SimpleQA: a Chinese factuality evaluation for large language models. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, 19182–19208

[38]	Du Y, Li S, Torralba A, Tenenbaum J B, Mordatch I. Improving factuality and reasoning in language models through multiagent debate. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 467

[39]	Nenkova A, Passonneau R. Evaluating content selection in summarization: the pyramid method. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004. 2004, 145–152

[40]	Zong C Q, Xia R, Zhang J J. Text Data Mining. 2nd ed. Beijing: Tsinghua University Press, 2022

[41]	Lin S, Hilton J, Evans O. TruthfulQA: measuring how models mimic human falsehoods. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, 3214–3252

[42]	Li J, Cheng X, Zhao X, Nie J Y, Wen J R. HaluEval: a large-scale hallucination evaluation benchmark for large language models. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 6449–6464

[43]	Mallen A, Asai A, Zhong V, Das R, Khashabi D, Hajishirzi H. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 9802–9822

[44]	Vu T, Iyyer M, Wang X, Constant N, Wei J, Wei J, Tar C, Sung Y H, Zhou D, Le Q, Luong T. FreshLLMs: refreshing large language models with search engine augmentation. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2024. 2024, 13697–13720

[45]	Scirè A, Bejgu A S, Tedeschi S, Ghonim K, Martelli F, Navigli R. Truth or mirage? Towards end-to-end factuality evaluation with LLM-oasis. 2025, arXiv preprint arXiv: 2411.19655

[46]	Li D, Jiang B, Huang L, Beigi A, Zhao C, Tan Z, Bhattacharjee A, Jiang Y, Chen C, Wu T, Kai Shu, Cheng L, Liu H. From generation to judgment: opportunities and challenges of LLM-as-a-judge. In: Proceedings of 2025 Conference on Empirical Methods in Natural Language Processing. 2025, 2757–2791

[47]

Zhang X, Gao W. Towards LLM-based fact verification on news claims with a hierarchical step-by-step prompting method. In: Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 996–1011

[48]	Guo Z, Schlichtkrull M, Vlachos A . A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 2022, 10: 178–206

[49]	Xu D, Chen W, Peng W, Zhang C, Xu T, Zhao X, Wu X, Zheng Y, Wang Y, Chen E . Large language models for generative information extraction: a survey. Frontiers of Computer Science, 2024, 18( 6): 186357

[50]

Wang Y, Reddy R G, Mujahid Z M, Arora A, Rubashevskii A, Geng J, Afzal O M, Pan L, Borenstein N, Pillai A, Augenstein I, Gurevych I, Nakov P. Factcheck-bench: fine-grained evaluation benchmark for automatic fact-checkers. In: Findings of the Association for Computational Linguistics: EMNLP 2024. 2024, 14199–14230

[51]	Khaliq M A, Chang P Y C, Ma M, Pflugfelder B, Miletić F. RAGAR, your falsehood radar: RAG-augmented reasoning for political fact-checking using multimodal large language models. In: Proceedings of the 7th Fact Extraction and VERification Workshop (FEVER). 2024, 280–296

[52]	Quelle D, Bovet A . The perils and promises of fact-checking with large language models. Frontiers in Artificial Intelligence, 2023, 7: 1341697

[53]	Sun X, Li J, Zhong Y, Zhao D, Yan R. Towards detecting LLMs hallucination via Markov chain-based multi-agent debate framework. In: Proceedings of 2025 IEEE International Conference on Acoustics, Speech and Signal Processing. 2025, 1–5

[54]	Han S, Zhang Q, Yao Y, Jin W, Xu Z. LLM multi-agent systems: challenges and open problems. 2025, arXiv preprint arXiv: 2402.03578

[55]	Guo T, Chen X, Wang Y, Chang R, Pei S, Chawla N V, Wiest O, Zhang X. Large language model based multi-agents: a survey of progress and challenges. In: Proceedings of the 33rd International Joint Conference on Artificial Intelligence. 2024, 8048–8057

[56]	Chen S, Liu Y, Han W, Zhang W, Liu T. A survey on LLM-based multi-agent system: recent advances and new frontiers in application. 2025, arXiv preprint arXiv: 2412.17481

[57]	Chan C M, Chen W, Su Y, Yu J, Xue W, Zhang S, Fu J, Liu Z. Chateval: towards better LLM-based evaluators through multi-agent debate. In: Proceedings of the 12th International Conference on Learning Representations. 2024, 9079–9093

[58]	Rahmani H A, Yilmaz E, Craswell N, Mitra B. Judgeblender: ensembling judgments for automatic relevance assessment. 2024, arXiv preprint arXiv: 2412.13268

[59]	Feng Z, Su J, Zheng J, Ren J, Zhang Y, Wu J, Wang H, Liu Z. M-MAD: multidimensional multi-agent debate for advanced machine translation evaluation. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, 7084–7107

[60]	Bo X, Zhang Z, Dai Q, Feng X, Wang L, Li R, Chen X, Wen J R. Reflective multi-agent collaboration based on large language models. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 4397

[61]	Chern E, Chern S, Chen S, Yuan W, Feng K, Zhou C, He J, Neubig G, Liu P. FacTool: factuality detection in generative AI - a tool augmented framework for multi-task and multi-domain scenarios.2023, arXiv preprint arXiv: 2307.13528

[62]	Chen S, Zhao Y, Zhang J, Chern I C, Gao S, Liu P, He J. FELM: benchmarking factuality evaluation of large language models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1927

[63]	Li M, Peng B, Galley M, Gao J, Zhang Z. Self-checker: plug-and-play modules for fact-checking with large language models. In: Proceedings of Findings of the Association for Computational Linguistics: NAACL 2024. 2024, 163–181

[64]	Wang B, Chern E, Liu P. Chinesefacteval: a factuality benchmark for Chinese LLMs. See gair-nlp.github.io/ChineseFactEval/ website,2023

[65]	Xu S, Leng Y, Yu L, Xiong D. Self-pluralising culture alignment for large language models. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025, 6859–6877

[66]	Yoffe L, Amayuelas A, Wang W Y. DebUnc: mitigating hallucinations in large language model agent communication with uncertainty estimations. 2025, arXiv preprint arXiv: 2407.06426