Beyond generalist LLMs: building and validating domain-specific models with the SpAMCQA benchmark

Xiaojian Ji; Nianzhe Sun; Anan Wang; Jing Dong; Jiawen Hu; Jian Zhu; Feng Huang; Zhengbo Zhang; Kunpeng Li; Da Teng; Tao Li

doi:10.20517/ais.2025.93

Artificial Intelligence Surgery ›› 2026, Vol. 6 ›› Issue (1) :80 -97. DOI: 10.20517/ais.2025.93

Original Article

Beyond generalist LLMs: building and validating domain-specific models with the SpAMCQA benchmark

Author information +

History +

PDF

Abstract

Aim: General-purpose Large Language Models (LLMs) exhibit significant limitations in high-stakes clinical domains such as spondyloarthritis (SpA) diagnosis, yet the absence of specialized evaluation tools precludes the quantification of these failures. This study aims to break this critical evaluation impasse and rigorously test the hypothesis that domain specialization is a necessity for achieving expert-level performance in complex medical diagnostics.

Methods: We employed a two-pronged experimental approach. First, we introduced the Spondyloarthritis Multiple-Choice Question Answering Benchmark (SpAMCQA), a comprehensive, expert-validated benchmark engineered to probe the nuanced diagnostic reasoning required for SpA. Second, to validate the domain specialization hypothesis, we developed the Spondyloarthritis Diagnosis Large Language Model (SpAD-LLM) by fine-tuning a foundation model on a curated corpus of SpA-specific clinical data. The efficacy of SpAD-LLM was then evaluated against leading generalist models, including Generative Pre-trained Transformer 4 (GPT-4), on the SpAMCQA testbed.

Results: On the SpAMCQA benchmark, our specialized SpAD-LLM achieved a state-of-the-art accuracy of 92.36%, decisively outperforming the 86.05% accuracy of the leading generalist model, GPT-4. This result provides the first empirical evidence on a purpose-built benchmark that generalist scaling alone is insufficient for mastering the specific inferential knowledge required for SpA diagnosis.

Conclusion: Our findings demonstrate that in high-stakes domains, domain specialization is not merely an incremental improvement but a categorical necessity. We release the SpAMCQA benchmark and full inference logs to the public, providing the community with a foundational evaluation toolkit, while positioning the SpAD-LLM series as a validated baseline to catalyze the development of truly expert-level medical artificial intelligence.

Keywords

Spondyloarthritis / large language model / benchmark / medical dataset / AI-assisted diagnosis / fine tuning

Cite this article

Download citation ▾

Xiaojian Ji, Nianzhe Sun, Anan Wang, Jing Dong, Jiawen Hu, Jian Zhu, Feng Huang, Zhengbo Zhang, Kunpeng Li, Da Teng, Tao Li. Beyond generalist LLMs: building and validating domain-specific models with the SpAMCQA benchmark. Artificial Intelligence Surgery, 2026, 6(1): 80-97 DOI:10.20517/ais.2025.93

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Malaviya AN,Sawhney S,Mehra NK,Kanga U. Seronegative arthritis in South Asia: an up-to-date review Curr Rheumatol Rep 2014 16 413

[2]	Taurog JD,Chhabra A,Colbert RA. Ankylosing spondylitis and axial spondyloarthritis N Engl J Med 2016 374 2563 74

[3]	Bittar M,Deodhar A. Axial spondyloarthritis: a review JAMA 2025 333 408 20

[4]	Singhal K,Tu T,Gottweis J.et al. Toward expert-level medical question answering with large language models Nat Med 2025 31 943 50 PMC11922739

[5]	Thirunavukarasu AJ,Ting DSJ,Elangovan K,Gutierrez L,Tan TF,Ting DSW. Large language models in medicine Nat Med 2023 29 1930 40

[6]	Singhal K,Azizi S,Tu T.et al. Large language models encode clinical knowledge Nature 2023 620 172 80 PMC10396962

[7]	Kraljevic Z,Bean D,Shek A.et al. Foresight - a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study Lancet Digit Health 2024 6 e281 90 PMC11220626

[8]	Shao Y,Cheng Y,Nelson SJ.et al. Hybrid value-aware transformer architecture for joint learning from longitudinal and non-longitudinal clinical data J Pers Med 2023 13 1070

[9]

Brown T,Mann B,Ryder N.et al. Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, Editors. Advances in Neural Information Processing Systems 33 (NeurIPS 2020); 2020 Dec 6-12; Virtual form. Red Hook: Curran Associates, Inc.; 2020. pp. 1877-901. Available from https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html [accessed 9 February 2026].

[10]	Touvron H,Martin L,Stone K.et al. Llama 2: open foundation and fine-tuned chat models arXiv 2023;arXiv:2307.09288. Available from https://arxiv.org/abs/2307.09288 [accessed 9 February 2026].

[11]	OpenAI, Josh Achiam, Steven Adler, et al. GPT-4 technical report. arXiv 2023;arXiv:2303.08774. Available from https://arxiv.org/abs/2303.08774 [accessed 9 February 2026].

[12]	Xie Q, Chen Q, Chen A, et al. Me-LLaMA: foundation large language models for medical applications. Res Sq 2024;rs.3.rs-4240043. Available from https://doi.org/10.21203/rs.3.rs-4240043/v1 [accessed 9 February 2026].

[13]	Chen Z, Hernández Cano A, Romanou A. MEDITRON-70b: scaling medical pretraining for large language models. arXiv 2023;arXiv:2311.16079. Available from https://arxiv.org/abs/2311.16079 [accessed 9 February 2026].

[14]	Chen J, et al. HuatuoGPT-II, one-stage training for medical adaption of LLMs. arXiv 2023;arXiv:2311.09774. Available from https://arxiv.org/abs/2311.09774 [accessed 9 February 2026].

[15]	Uz C,Umay E. “Dr ChatGPT”: is it a reliable and useful source for common rheumatic diseases? Int J Rheum Dis 2023;26:1343-9

[16]

Papineni K,Roukos S,Ward T,Zhu WJ. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL); 2002 Jul 6-12; Philadelphia, Pennsylvania, USA. Cambridge, Massachusetts, USA: Association for Computational Linguistics; 2002. pp. 311-18

[17]

Wang X,Chen G,Dingjie S.et al. CMB: a comprehensive medical benchmark in Chinese. In: Duh K, Gomez H, Bethard S, Editors. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2024 Jun 16-21; Mexico City, Mexico. Cambridge, Massachusetts, USA: Association for Computational Linguistics; 2024. pp. 6184-205

[18]

Zhang N,Chen M,Bi Z.et al. CBLUE: a Chinese biomedical language understanding evaluation benchmark. In: Muresan S, Nakov P, Villavicencio A, Editors. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics; 2022 May 22-27; Dublin, Ireland. Cambridge, Massachusetts, USA: Association for Computational Linguistics; 2022. pp. 7888-915

[19]

He Z,Wang Y,Yan A. MedEval: a multi-level, multi-task, and multi-domain medical benchmark for language model evaluation. In: Bouamor H, Pino J, Bali K, Editors. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; 2023 Dec 6-10; Singapore. Cambridge, Massachusetts, USA: Association for Computational Linguistics; 2023. pp. 8725-44

[20]	Xu J,Lu L,Peng X.et al. Data set and benchmark (MedGPTEval) to evaluate responses from large language models in medicine: evaluation development and validation JMIR Med Inform 2024 12 e57674 PMC11225096

[21]	Chen W,Li Z,Fang H.et al. A benchmark for automatic medical consultation system: frameworks, tasks and datasets Bioinformatics 2023 39 btac817 PMC9848052

[22]

Li M,Cai W,Liu R.et al. FFA-IR: towards an explainable and reliable medical report generation benchmark. In: Vanschoren J, Yeung S, Editors. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks; 2021. Available from https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/35f4a8d465e6e1edc05f3d8ab658c551-Abstract.html [accessed 9 February 2026]

[23]	Christophe C, Kanithi P, Munjal P, et al. Med42-evaluating fine-tuning strategies for medical LLMs: full-parameter vs. parameter-efficient approaches. arXiv 2024;arXiv:2404.14779. Available from https://arxiv.org/abs/2404.14779 [accessed 9 February 2026].

[24]	Nori H, King N, McKinney SM, et al. Capabilities of GPT-4 on medical challenge problems. arXiv 2023;arXiv:2303.13375. Available from https://arxiv.org/abs/2303.13375 [accessed 9 February 2026].

[25]	Liu F, Li Z, Zhou H, et al. Large language models in the clinic: a comprehensive benchmark. arXiv 2024;arXiv:2405.00716. Available from https://arxiv.org/abs/2405.00716 [accessed 9 February 2026].

[26]	Huang F,Zhu J,Wang YH.et al. Recommendations for diagnosis and treatment of ankylosing spondylitis Zhonghua Nei Ke Za Zhi 2022 61 893 900

[27]	van der Heijde D,Ramiro S,Landewé R.et al. 2016 update of the ASAS-EULAR management recommendations for axial spondyloarthritis Ann Rheum Dis 2017 76 978 91

[28]	Sieper J,Rudwaleit M,Baraliakos X.et al. The Assessment of SpondyloArthritis international Society (ASAS) handbook: a guide to assess spondyloarthritis Ann Rheum Dis 2009 68 Suppl 2 ii1 44

[29]	Madrid-García A,Rosales-Rosado Z,Freites-Nuñez D.et al. Harnessing ChatGPT and GPT-4 for evaluating the rheumatology questions of the Spanish access exam to specialized medical training Sci Rep 2023 13 22129 PMC10719375

[30]	Wang A,Wu Y,Ji X.et al. Assessing and optimizing large language models on spondyloarthritis multi-choice question answering: protocol for enhancement and assessment JMIR Res Protoc 2024 13 e57001 PMC11161706

[31]	Mitra A, Del Corro L, Mahajan S. Orca 2: teaching small language models how to reason. arXiv 2023;arXiv:2311.11045. Available from https://arxiv.org/abs/2311.11045 [accessed 9 February 2026].

[32]	Li J, Wang X, Wu X, et al. Huatuo-26M, a large-scale Chinese medical QA dataset. arXiv 2023;arXiv:2305.01526. Available from https://arxiv.org/abs/2305.01526 [accessed 9 February 2026].

[33]

Liu J,Zhou P,Hua Y.et al. Benchmarking large language models on CMExam - a comprehensive Chinese medical exam dataset. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, Editors. Advances in Neural Information Processing Systems 36; 2023 Dec 10-16; New Orleans, Louisiana, USA. Cambridge, Massachusetts, USA: Association for Computational Linguistics; 2023. pp. 52430-52. Available from https://proceedings.neurips.cc/paper_files/paper/2023/hash/a48ad12d588c597f4725a8b84af647b5-Abstract-Datasets_and_Benchmarks.html [accessed 9 February 2026].

[34]

Li J,Zhong S,Chen K. MLEC-QA: a Chinese multi-choice biomedical question answering dataset. In: Moens MF, Huang X, Specia L, Yih SWT, Editors. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; 2021 Nov 7-11; Online and Punta Cana, Dominican Republic. Cambridge, Massachusetts, USA: Association for Computational Linguistics; 2021. pp. 8862-74

[35]	Leaderboard - C-Eval. Updated on 2025 Jul 26. Available from: https://cevalbenchmark.com/static/leaderboard.html [accessed 9 February 2026]

[36]	01.AI, Young A, Chen B, et al. Yi: Open Foundation Models by 01.AI. arXiv 2024;arXiv:2403.04652. Available from https://arxiv.org/abs/2403.04652 [accessed 9 February 2026].

[37]	Hu EJ, Shen Y, Wallis P, et al. LoRA: low-rank adaptation of large language models. arXiv 2021;arXiv:2106.09685. Available from https://arxiv.org/abs/2106.09685 [accessed 9 February 2026].