Evaluation of DeepSeek-R1 and its distilled models for performance and cost efficiency in oncology

Xiao Wei; Fangcen Liu; Kai Xin; Lijing Zhu

doi:10.36922/EJMO025150097

Eurasian Journal of Medicine and Oncology ›› 2025, Vol. 9 ›› Issue (4) :160 -167. DOI: 10.36922/EJMO025150097

ORIGINAL RESEARCH ARTICLE

research-article

Evaluation of DeepSeek-R1 and its distilled models for performance and cost efficiency in oncology

Xiao Wei ¹^,^†
, Fangcen Liu ¹^,^†
, Kai Xin ²^,^*
, Lijing Zhu ²^,^*

Author information +

History +

PDF (706KB)

Abstract

Introduction: Malignant tumors represent a significant public health threat, and the integration of artificial intelligence in health care is increasingly becoming a priority. Many oncology institutions are already considering the use of DeepSeek-R1 to assist doctors in making complex medical decisions. However, there remains a lack of sufficient evidence regarding the accuracy, consistency, and cost-efficiency of DeepSeek-R1 and its distilled models in oncology decision-making. This study aims to fill this gap by evaluating the performance and cost-effectiveness of DeepSeek-R1 and its distilled models in oncology, providing critical insights into their potential for clinical integration.

Objectives: This study aimed to systematically evaluate the performance, consistency, and cost-efficiency of the open-source large language model (LLM) DeepSeek-R1 and its distilled variants in the context of oncology decision-making, using a benchmark derived from the MedQA dataset.

Methods: A custom oncology question set containing 1,206 multiple choice questions was curated from MedQA. Seven models, including DeepSeek-R1 and six distilled versions, were evaluated using an automated testing framework. Accuracy, consistency, latency, and token consumption were compared across models. Statistical tests, including McNemar and Wilcoxon signed-rank, were used to assess differences in performance. Questions were also categorized into clinical task types (diagnosis, treatment, triage, and follow-up) for subgroup analysis.

Results: DeepSeek-R1 achieved the highest performance (accuracy: 91.38%; consistency: 90.47%), whereas DeepSeek-R1-Distill-Qwen-32B was the only distilled model to exceed both metrics at the 0.8 threshold (accuracy: 88.72%; consistency: 81.44%). DeepSeek-R1 demonstrated significantly higher accuracy than its distilled counterpart (p<0.05), particularly in diagnosis- and treatment-related tasks (p<0.05). However, it also exhibited significantly greater latency and token consumption. A Cohen’s kappa value of 0.575 indicated moderate agreement between the two models.

Conclusion: DeepSeek-R1 is more suitable for high-stakes oncology tasks requiring high accuracy and consistency, whereas DeepSeek-R1-Distill-Qwen-32B offers a cost-effective alternative for use in outpatient or resource-limited settings. These findings support a task- and resource-adaptive deployment strategy for LLMs in clinical oncology.

Keywords

DeepSeek-R1 / Distilled models / Oncology / Performance / Cost efficiency

Cite this article

Download citation ▾

Xiao Wei, Fangcen Liu, Kai Xin, Lijing Zhu. Evaluation of DeepSeek-R1 and its distilled models for performance and cost efficiency in oncology. Eurasian Journal of Medicine and Oncology, 2025, 9(4): 160-167 DOI:10.36922/EJMO025150097

登录浏览全文

4963

注册一个新账户忘记密码

Funding

None.

Conflict of interest

The authors declare no conflicts of interest.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020; 33: 1877-1901. doi: 10.48550/arXiv.2005.14165

[2]

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Pennsylvania: Association for Computational Linguistics; Vol. 1. 2019. p. 4171-4186.doi: 10.48550/arXiv.1810.04805

[3]	Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models. arXiv; 2020. doi: 10.48550/arXiv.2001.08361

[4]	Stiennon N, Ouyang L, Wu J, et al. Learning to summarize with human feedback. In: Advances in Neural Information Processing Systems. Vol. 33. Cambridge: MIT Press; 2020. p. 3008-3021. doi: 10.48550/arXiv.2009.03125

[5]	Liao H. Deepseek large-scale model: Technical analysis and development prospect. J Comput Sci Electr Eng. 2025; 7(1):33-37. doi: 10.61784/jcsee3035

[6]	De Carvalho GP, Sawanobori T, Horii T. Data-driven motion planning: A survey on deep neural networks, reinforcement learning, and large language model approaches. IEEE Access. 2025; 13:52195-52245. doi: 10.1109/ACCESS.2025.3552225

[7]	Guo D, Yang D, Zhang H, et al. Deepseek-R1: Incentivizing Reasoning Capability in LLMs Via Reinforcement Learning. arXiv. China: DeepSeek; 2025. doi: 10.48550/arXiv.2501.12948

[8]	Gou J, Yu B, Maybank SJ, Tao D. Knowledge distillation: A survey. Int J Comput Vis. 2021; 129(6):1789-1819. doi: 10.1007/s11263-021-01453-z

[9]	Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv. 2015. doi: 10.48550/arXiv.1503.02531

[10]	Shimizu H, Nakayama KI. Artificial intelligence in oncology. Cancer Sci. 2020; 111(5):1452-1460. doi: 10.1111/cas.14377

[11]	Mulita F, Verras GI, Anagnostopoulos CN, Kotis K. A smarter health through the internet of surgical things. Sensors (Basel). 2022; 22(12):4577. doi: 10.3390/s22124577

[12]	Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. [arXiv Preprint]; 2019. doi: 10.48550/arXiv.1910.01108

[13]	Esmaeilzadeh P. Challenges and strategies for wide-scale artificial intelligence (AI) deployment in healthcare practices: A perspective for healthcare organizations. Artif Intell Med. 2024; 151:102861. doi: 10.1016/j.artmed.2024.102861

[14]	Sung H, Ferlay J, Siegel RL, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021; 71(3):209-249. doi: 10.3322/caac.21660

[15]	Huang Y, Tang K, Chen M, Wang B. A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry. [arXiv Preprint]; 2024. doi: 10.48550/arXiv.2404.15777

[16]	Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019; 25(1):44-56. doi: 10.1038/s41591-018-0300-7

[17]	Jiang L, Wu Z, Xu X, et al. Opportunities and challenges of artificial intelligence in the medical field: current application, emerging problems, and problem-solving strategies. J Int Med Res. 2021; 49(3):03000605211000157. doi: 10.1177/03000605211000157

[18]	Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D. The role of large language models in medical education: Applications and implications. JMIR Med Educ. 2023;9:e50945. doi: 10.2196/50945