Development and Performance of a Large Language Model for the Quality Evaluation of Multi-Language Medical Imaging Guidelines and Consensus

Zhixiang Wang , Jing Sun , Hui Liu , Xufei Luo , Jia Li , Wenjun He , Zhenhua Yang , Han Lv , Yaolong Chen , Zhenchang Wang

Journal of Evidence-Based Medicine ›› 2025, Vol. 18 ›› Issue (2) : e70020

PDF
Journal of Evidence-Based Medicine ›› 2025, Vol. 18 ›› Issue (2) : e70020 DOI: 10.1111/jebm.70020
ARTICLE

Development and Performance of a Large Language Model for the Quality Evaluation of Multi-Language Medical Imaging Guidelines and Consensus

Author information +
History +
PDF

Abstract

Aim: This study aimed to develop and evaluate an automated large language model (LLM)-based system for assessing the quality of medical imaging guidelines and consensus (GACS) in different languages, focusing on enhancing evaluation efficiency, consistency, and reducing manual workload.

Method: We developed the QPC-HASE-GuidelineEval algorithm, which integrates a Four-Quadrant Questions Classification Strategy and Hybrid Search Enhancement. The model was validated on 45 medical imaging guidelines (36 in Chinese and 9 in English) published in 2021 and 2022. Key evaluation metrics included consistency with expert assessments, hybrid search paragraph matching accuracy, information completeness, comparisons of different paragraph matching approaches, and cost-time efficiency.

Results: The algorithm demonstrated an average accuracy of 77%, excelling in simpler tasks but showing lower accuracy (29%–40%) in complex evaluations, such as explanations and visual aids. The average accuracy rates of the English and Chinese versions of the GACS were 74% and 76%, respectively (p = 0.37). Hybrid search demonstrated superior performance with paragraph matching accuracy (4.42) and information completeness (4.42), significantly outperforming keyword-based search (1.05/1.05) and sparse-dense retrieval (4.26/3.63). The algorithm significantly reduced evaluation time to 8 min and 30 s per guideline and reduced costs to approximately 0.5 USD per guideline, offering a considerable advantage over traditional manual methods.

Conclusion: The QPC-HASE-GuidelineEval algorithm, powered by LLMs, showed strong potential for improving the efficiency, scalability, and multi-language capability of guideline evaluations, though further enhancements are needed to handle more complex tasks that require deeper interpretation.

Keywords

consensus / guideline transparency / guideline / large language model / medical imaging / quality assessment

Cite this article

Download citation ▾
Zhixiang Wang, Jing Sun, Hui Liu, Xufei Luo, Jia Li, Wenjun He, Zhenhua Yang, Han Lv, Yaolong Chen, Zhenchang Wang. Development and Performance of a Large Language Model for the Quality Evaluation of Multi-Language Medical Imaging Guidelines and Consensus. Journal of Evidence-Based Medicine, 2025, 18(2): e70020 DOI:10.1111/jebm.70020

登录浏览全文

4963

注册一个新账户 忘记密码

References

RIGHTS & PERMISSIONS

2025 Chinese Cochrane Center, West China Hospital of Sichuan University and John Wiley & Sons Australia, Ltd.

AI Summary AI Mindmap
PDF

9

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/