OpenRedRL: A Light-Weight Benchmark for Reinforcement Learning-Based Red Teaming

Xiang ZHENG; Xingjun MA; Wei-Bin LEE; Cong WANG

doi:10.1007/s11704-026-51865-8

Front. Comput. Sci. ›› DOI: 10.1007/s11704-026-51865-8

RESEARCH ARTICLE

OpenRedRL: A Light-Weight Benchmark for Reinforcement Learning-Based Red Teaming

Author information +

History +

PDF (4623KB)

Abstract

Red teaming has proven effective for identifying and mitigating vulnerabilities in Large Language Models (LLMs). Reinforcement Learning (RL) has emerged as a promising strategy among existing red teaming techniques. However, a lack of a unified benchmark hinders current RL-based red teaming methods. Implementation details, especially in Proximal Policy Optimization (PPO)-based RL, significantly affect the stability and reproducibility of outcomes. To address this issue, we introduce OpenRedRL, a lightweight benchmark that simplifies and standardizes the implementation and evaluation of RL-based red teaming. OpenRedRL combines the design strengths of both single-file CleanRL and highly modularized Tianshou, offering high-quality single-file red teaming implementations and modular PPO core components, such as the General Advantage Estimator. It supports a variety of token and sentence diversity metrics, featuring modularized intrinsic reward computation that facilitates plug-and-play experimentation. To clarify their influence on RL performance, we conducted an extensive ablation study of key components, including Low-Rank Adaptation (LoRA), Kullback-Leibler (KL) divergence, and Lagrange Multiplier. We hope this work contributes to 1) gaining a comprehensive understanding of the implementation nuances of RL-based red teaming algorithms, and 2) enabling rapid prototyping of innovative features for RL-based red teaming. Code for the benchmark is publicly available at https://github.com/x-zheng16/OpenRedRL.

Keywords

reinforcement learning / red teaming / benchmark / intrinsic motivation / diversity / large language models

Cite this article

Download citation ▾

Xiang ZHENG, Xingjun MA, Wei-Bin LEE, Cong WANG. OpenRedRL: A Light-Weight Benchmark for Reinforcement Learning-Based Red Teaming. Front. Comput. Sci. DOI:10.1007/s11704-026-51865-8

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

RIGHTS & PERMISSIONS

Higher Education Press 2026

PDF (4623KB)

178

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submission

Call for papers

Guidelines for authors

Download templates

Guidelines for reviewers

Abstract

Keywords

Cite this article

References

RIGHTS & PERMISSIONS

Just Accepted