MAD-Fact: a multi-agent debate framework for long-form factuality evaluation in LLMs
Yucheng NING , Xixun LIN , Fang FANG , Yanan CAO
Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (4) : 2104802
The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.
information security / large language model / long-form text generation / factuality evaluation / multi-agent system
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
Li J, Cheng X, Zhao X, Nie J Y, Wen J R. HaluEval: a large-scale hallucination evaluation benchmark for large language models. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 6449–6464 |
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
Li D, Jiang B, Huang L, Beigi A, Zhao C, Tan Z, Bhattacharjee A, Jiang Y, Chen C, Wu T, Kai Shu, Cheng L, Liu H. From generation to judgment: opportunities and challenges of LLM-as-a-judge. In: Proceedings of 2025 Conference on Empirical Methods in Natural Language Processing. 2025, 2757–2791 |
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
|
| [51] |
Khaliq M A, Chang P Y C, Ma M, Pflugfelder B, Miletić F. RAGAR, your falsehood radar: RAG-augmented reasoning for political fact-checking using multimodal large language models. In: Proceedings of the 7th Fact Extraction and VERification Workshop (FEVER). 2024, 280–296 |
| [52] |
|
| [53] |
|
| [54] |
|
| [55] |
|
| [56] |
|
| [57] |
|
| [58] |
|
| [59] |
|
| [60] |
|
| [61] |
Chern E, Chern S, Chen S, Yuan W, Feng K, Zhou C, He J, Neubig G, Liu P. FacTool: factuality detection in generative AI - a tool augmented framework for multi-task and multi-domain scenarios.2023, arXiv preprint arXiv: 2307.13528 |
| [62] |
|
| [63] |
|
| [64] |
|
| [65] |
|
| [66] |
|
Higher Education Press
/
| 〈 |
|
〉 |