Evaluating the potential risks of employing large language models in peer review

Lingxuan Zhu , Yancheng Lai , Jiarui Xie , Weiming Mou , Lihaoyun Huang , Chang Qi , Tao Yang , Aimin Jiang , Wenyi Gan , Dongqiang Zeng , Bufu Tang , Mingjia Xiao , Guangdi Chu , Zaoqu Liu , Quan Cheng , Anqi Lin , Peng Luo

Clinical and Translational Discovery ›› 2025, Vol. 5 ›› Issue (4) : e70067

PDF
Clinical and Translational Discovery ›› 2025, Vol. 5 ›› Issue (4) : e70067 DOI: 10.1002/ctd2.70067
RESEARCH ARTICLE

Evaluating the potential risks of employing large language models in peer review

Author information +
History +
PDF

Abstract

Objective: This study aims to systematically investigate the potential harms of Large Language Models (LLMs) in the peer review process.

Background: LLMs are increasingly used in academic processes, including peer review. While they can address challenges like reviewer scarcity and review efficiency, concerns about fairness, transparency and potential biases in LLM-generated reviews have not been thoroughly investigated.

Methods: Claude 2.0 was used to generate peer review reports, rejection recommendations, citation requests and refutations for 20 original, unmodified cancer biology manuscripts obtained from eLife's new publishing model. Artificial intelligence (AI) detection tools (zeroGPT and GPTzero) assessed whether the reviews were identifiable as LLM-generated.All LLM-generated outputs were evaluated for reasonableness by two expert on a five-point Likert scale.

Results: LLM-generated reviews were somewhat consistent with human reviews but lacked depth, especially in detailed critique. The model proved highly proficient at generating convincing rejection comments and could create plausible citation requests, including requests for unrelated references. AI detectors struggled to identify LLM-generated reviews, with 82.8% of responses classified as human-written by GPTzero.

Conclusions: LLMs can be readily misused to undermine the peer review process by generating biased, manipulative, and difficult-to-detect content, posing a significant threat to academic integrity. Guidelines and detection tools are needed to ensure LLMs enhance rather than harm the peer review process.

Keywords

academic integrity / artificial intelligence / large language models / peer review

Cite this article

Download citation ▾
Lingxuan Zhu, Yancheng Lai, Jiarui Xie, Weiming Mou, Lihaoyun Huang, Chang Qi, Tao Yang, Aimin Jiang, Wenyi Gan, Dongqiang Zeng, Bufu Tang, Mingjia Xiao, Guangdi Chu, Zaoqu Liu, Quan Cheng, Anqi Lin, Peng Luo. Evaluating the potential risks of employing large language models in peer review. Clinical and Translational Discovery, 2025, 5(4): e70067 DOI:10.1002/ctd2.70067

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Wang J, Liang Y, Meng F, et al. Zero-shot cross-lingual summarization via large language models. ACL Anthol. 2023: 12-23.

[2]

Tam D, Mascarenhas A, Zhang S, Kwan S, Bansal M, Raffel C. Evaluating the factual consistency of large language models through news summarization. ACL Anthol. 2023: 5220-5255.

[3]

Hwang SI, Lim JS, Lee RW, et al. Is ChatGPT a “fire of prometheus” for non-native English-speaking researchers in academic writing? Korean J. Radiol. 2023; 24: 952-959.

[4]

Singh Chawla D. Is ChatGPT corrupting peer review? Telltale words hint at AI use. Nature. 2024; 628: 483-484.

[5]

Verharen JPH. ChatGPT identifies gender disparities in scientific peer review. Elife. 2023; 12:RP90230.

[6]

Liang W, Izzo Z, Zhang Y, et al. Monitoring AI-modified content at scale: a case study on the impact of ChatGPT on AI conference peer reviews. ICML24. 2024: 29575-29620.

[7]

Graham F. Daily briefing: signs that ChatGPT is polluting peer review. Nature. 2024.

[8]

Larsen PO, von Ins M. The rate of growth in scientific publication and the decline in coverage provided by science citation index. Scientometrics. 2010; 84: 575-603.

[9]

Biswas S, Dobaria D, Cohen HL. ChatGPT and the future of journal reviews: a feasibility study. Yale J Biol Med. 2023; 96: 415-420.

[10]

GitHub - nishiwen1214/ChatReviewer. GitHub. Accessed January 20, 2024. https://github.com/nishiwen1214/ChatReviewer/blob/main/readme_en.md

[11]

Liang W, Zhang Y, Cao H, et al. Can large language models provide useful feedback on research papers? A large-scale empirical analysis. arXiv. 2023.

[12]

von Wedel D, Schmitt RA, Thiele M, et al. Affiliation bias in peer review of abstracts by a large language model. JAMA. 2024; 331: 252-253.

[13]

Stiller-Reeve M. How to write a thorough peer review. Nature. 2018.

[14]

Zhu L, Mou W, Hong C, et al. The evaluation of generative AI should include repetition to assess stability. JMIR MHealth UHealth. 2024; 12:e57978.

[15]

AI detector - trusted AI checker for ChatGPT, GPT4 & gemini. ZeroGPT. Accessed April 6, 2024. https://www.zerogpt.com/

[16]

GPTZero | The trusted AI detector for ChatGPT, GPT-4, & more. GPTZero. Accessed April 6, 2024. https://gptzero.me/

[17]

Flesch R. A new readability yardstick. J Appl Psychol. 1948; 32: 221-233.

[18]

Thombs BD, Levis AW, Razykov I, et al. Potentially coercive self-citation by peer reviewers: a cross-sectional study. J Psychosom Res. 2015; 78: 1-6. https://doi.org/10.1016/j.jpsychores.2014.09.015

[19]

Ryan J. Plagiarism in peer-review reports could be the ‘tip of the iceberg’. Nature. 2024.

RIGHTS & PERMISSIONS

2025 The Author(s). Clinical and Translational Discovery published by John Wiley & Sons Australia, Ltd on behalf of Shanghai Institute of Clinical Bioinformatics.

AI Summary AI Mindmap
PDF

10

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/