PediaBench: a comprehensive Chinese pediatric dataset for benchmarking large language models

Qian ZHANG; Panfeng CHEN; Linkun FENG; Shuyu LIU; Jiali LI; Heng ZHAO; Mei CHEN; Hui LI; Yanhao WANG

doi:10.1007/s11704-025-41345-w

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (3) :2003902 DOI: 10.1007/s11704-025-41345-w

Interdisciplinary

LETTER

PediaBench: a comprehensive Chinese pediatric dataset for benchmarking large language models

Qian ZHANG ¹^,²
, Panfeng CHEN ¹^,²
, Linkun FENG ¹^,²
, Shuyu LIU ¹^,²
, Jiali LI ¹^,²
, Heng ZHAO ⁴
, Mei CHEN ¹^,²
, Hui LI ¹^,²^,^†
, Yanhao WANG ³^,^†

Author information +

History +

PDF (833KB)

Graphical abstract

Cite this article

Download citation ▾

Qian ZHANG, Panfeng CHEN, Linkun FENG, Shuyu LIU, Jiali LI, Heng ZHAO, Mei CHEN, Hui LI, Yanhao WANG. PediaBench: a comprehensive Chinese pediatric dataset for benchmarking large language models. Front. Comput. Sci., 2026, 20 (3) : 2003902 DOI:10.1007/s11704-025-41345-w

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]

Singhal K, Azizi S, Tu T, Mahdavi S S, Wei J, Chung H W, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Agüera y Arcas B, Webster D, Corrado G S, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V . Large language models encode clinical knowledge. Nature, 2023, 620( 7972): 172–180

[2]

Zhang N, Chen M, Bi Z, Liang X, Li L, Shang X, Yin K, Tan C, Xu J, Huang F, Si L, Ni Y, Xie G, Sui Z, Chang B, Zong H, Yuan Z, Li L F, Yan J, Zan H, Zhang K, Tang B, Chen Q. CBLUE: a Chinese biomedical language understanding evaluation benchmark. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 7888–7915

[3]	Jin D, Pan E, Oufattole N, Weng W H, Fang H, Szolovits P . What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 2021, 11( 14): 6421

[4]	Li J, Wang X, Wu X, Zhang Z, Xu X, Fu J, Tiwari P, Wan X, Wang B. Huatuo-26M, a large-scale Chinese medical QA dataset. 2023, arXiv preprint arXiv: 2305.01526

[5]	Wang X, Chen G, Song D, Zhang Z, Chen Z, Xiao Q, Chen J, Jiang F, Li J, Wan X, Wang B, Li H. CMB: a comprehensive medical benchmark in Chinese. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics. 2024, 6184–6205

[6]	Cai Y, Wang L, Wang Y, de Melo G, Zhang Y, Wang Y, He L. MedBench: a large-scale Chinese benchmark for evaluating medical large language models. In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. 2024, 17709–17717

[7]	Liu J, Zhou P, Hua Y, Chong D, Tian Z, Liu A, Wang H, You C, Guo Z, Zhu L, Li M L. Benchmarking large language models on CMExam - a comprehensive Chinese medical exam dataset. In: Proceedings of the 37th International Conference on Neural Information Processing System. 2023, 2283

[8]	Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, Tang J. GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 320–335

[9]	Bai Y, Ying J, Cao Y, Lv X, He Y, Wang X, Yu J, Zeng K, Xiao Y, Lyu H, Zhang J, Li J, Hou L. Benchmarking foundation models with language-model-as-an-examiner. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 3414