Beyond generalist LLMs: building and validating domain-specific models with the SpAMCQA benchmark
Xiaojian Ji , Nianzhe Sun , Anan Wang , Jing Dong , Jiawen Hu , Jian Zhu , Feng Huang , Zhengbo Zhang , Kunpeng Li , Da Teng , Tao Li
Artificial Intelligence Surgery ›› 2026, Vol. 6 ›› Issue (1) : 80 -97.
Aim: General-purpose Large Language Models (LLMs) exhibit significant limitations in high-stakes clinical domains such as spondyloarthritis (SpA) diagnosis, yet the absence of specialized evaluation tools precludes the quantification of these failures. This study aims to break this critical evaluation impasse and rigorously test the hypothesis that domain specialization is a necessity for achieving expert-level performance in complex medical diagnostics.
Methods: We employed a two-pronged experimental approach. First, we introduced the Spondyloarthritis Multiple-Choice Question Answering Benchmark (SpAMCQA), a comprehensive, expert-validated benchmark engineered to probe the nuanced diagnostic reasoning required for SpA. Second, to validate the domain specialization hypothesis, we developed the Spondyloarthritis Diagnosis Large Language Model (SpAD-LLM) by fine-tuning a foundation model on a curated corpus of SpA-specific clinical data. The efficacy of SpAD-LLM was then evaluated against leading generalist models, including Generative Pre-trained Transformer 4 (GPT-4), on the SpAMCQA testbed.
Results: On the SpAMCQA benchmark, our specialized SpAD-LLM achieved a state-of-the-art accuracy of 92.36%, decisively outperforming the 86.05% accuracy of the leading generalist model, GPT-4. This result provides the first empirical evidence on a purpose-built benchmark that generalist scaling alone is insufficient for mastering the specific inferential knowledge required for SpA diagnosis.
Conclusion: Our findings demonstrate that in high-stakes domains, domain specialization is not merely an incremental improvement but a categorical necessity. We release the SpAMCQA benchmark and full inference logs to the public, providing the community with a foundational evaluation toolkit, while positioning the SpAD-LLM series as a validated baseline to catalyze the development of truly expert-level medical artificial intelligence.
Spondyloarthritis / large language model / benchmark / medical dataset / AI-assisted diagnosis / fine tuning
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
OpenAI, Josh Achiam, Steven Adler, et al. GPT-4 technical report. arXiv 2023;arXiv:2303.08774. Available from https://arxiv.org/abs/2303.08774 [accessed 9 February 2026]. |
| [12] |
Xie Q, Chen Q, Chen A, et al. Me-LLaMA: foundation large language models for medical applications. Res Sq 2024;rs.3.rs-4240043. Available from https://doi.org/10.21203/rs.3.rs-4240043/v1 [accessed 9 February 2026]. |
| [13] |
Chen Z, Hernández Cano A, Romanou A. MEDITRON-70b: scaling medical pretraining for large language models. arXiv 2023;arXiv:2311.16079. Available from https://arxiv.org/abs/2311.16079 [accessed 9 February 2026]. |
| [14] |
Chen J, et al. HuatuoGPT-II, one-stage training for medical adaption of LLMs. arXiv 2023;arXiv:2311.09774. Available from https://arxiv.org/abs/2311.09774 [accessed 9 February 2026]. |
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
Christophe C, Kanithi P, Munjal P, et al. Med42-evaluating fine-tuning strategies for medical LLMs: full-parameter vs. parameter-efficient approaches. arXiv 2024;arXiv:2404.14779. Available from https://arxiv.org/abs/2404.14779 [accessed 9 February 2026]. |
| [24] |
Nori H, King N, McKinney SM, et al. Capabilities of GPT-4 on medical challenge problems. arXiv 2023;arXiv:2303.13375. Available from https://arxiv.org/abs/2303.13375 [accessed 9 February 2026]. |
| [25] |
Liu F, Li Z, Zhou H, et al. Large language models in the clinic: a comprehensive benchmark. arXiv 2024;arXiv:2405.00716. Available from https://arxiv.org/abs/2405.00716 [accessed 9 February 2026]. |
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
Mitra A, Del Corro L, Mahajan S. Orca 2: teaching small language models how to reason. arXiv 2023;arXiv:2311.11045. Available from https://arxiv.org/abs/2311.11045 [accessed 9 February 2026]. |
| [32] |
Li J, Wang X, Wu X, et al. Huatuo-26M, a large-scale Chinese medical QA dataset. arXiv 2023;arXiv:2305.01526. Available from https://arxiv.org/abs/2305.01526 [accessed 9 February 2026]. |
| [33] |
|
| [34] |
|
| [35] |
Leaderboard - C-Eval. Updated on 2025 Jul 26. Available from: https://cevalbenchmark.com/static/leaderboard.html [accessed 9 February 2026] |
| [36] |
01.AI, Young A, Chen B, et al. Yi: Open Foundation Models by 01.AI. arXiv 2024;arXiv:2403.04652. Available from https://arxiv.org/abs/2403.04652 [accessed 9 February 2026]. |
| [37] |
Hu EJ, Shen Y, Wallis P, et al. LoRA: low-rank adaptation of large language models. arXiv 2021;arXiv:2106.09685. Available from https://arxiv.org/abs/2106.09685 [accessed 9 February 2026]. |
/
| 〈 |
|
〉 |