Evaluating the potential risks of employing large language models in peer review
Lingxuan Zhu , Yancheng Lai , Jiarui Xie , Weiming Mou , Lihaoyun Huang , Chang Qi , Tao Yang , Aimin Jiang , Wenyi Gan , Dongqiang Zeng , Bufu Tang , Mingjia Xiao , Guangdi Chu , Zaoqu Liu , Quan Cheng , Anqi Lin , Peng Luo
Clinical and Translational Discovery ›› 2025, Vol. 5 ›› Issue (4) : e70067
Evaluating the potential risks of employing large language models in peer review
Objective: This study aims to systematically investigate the potential harms of Large Language Models (LLMs) in the peer review process.
Background: LLMs are increasingly used in academic processes, including peer review. While they can address challenges like reviewer scarcity and review efficiency, concerns about fairness, transparency and potential biases in LLM-generated reviews have not been thoroughly investigated.
Methods: Claude 2.0 was used to generate peer review reports, rejection recommendations, citation requests and refutations for 20 original, unmodified cancer biology manuscripts obtained from eLife's new publishing model. Artificial intelligence (AI) detection tools (zeroGPT and GPTzero) assessed whether the reviews were identifiable as LLM-generated.All LLM-generated outputs were evaluated for reasonableness by two expert on a five-point Likert scale.
Results: LLM-generated reviews were somewhat consistent with human reviews but lacked depth, especially in detailed critique. The model proved highly proficient at generating convincing rejection comments and could create plausible citation requests, including requests for unrelated references. AI detectors struggled to identify LLM-generated reviews, with 82.8% of responses classified as human-written by GPTzero.
Conclusions: LLMs can be readily misused to undermine the peer review process by generating biased, manipulative, and difficult-to-detect content, posing a significant threat to academic integrity. Guidelines and detection tools are needed to ensure LLMs enhance rather than harm the peer review process.
academic integrity / artificial intelligence / large language models / peer review
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
GitHub - nishiwen1214/ChatReviewer. GitHub. Accessed January 20, 2024. https://github.com/nishiwen1214/ChatReviewer/blob/main/readme_en.md |
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
AI detector - trusted AI checker for ChatGPT, GPT4 & gemini. ZeroGPT. Accessed April 6, 2024. https://www.zerogpt.com/ |
| [16] |
GPTZero | The trusted AI detector for ChatGPT, GPT-4, & more. GPTZero. Accessed April 6, 2024. https://gptzero.me/ |
| [17] |
|
| [18] |
|
| [19] |
|
2025 The Author(s). Clinical and Translational Discovery published by John Wiley & Sons Australia, Ltd on behalf of Shanghai Institute of Clinical Bioinformatics.
/
| 〈 |
|
〉 |