Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

Zhen-Yu ZHANG; Yuting TANG; Jiandong ZHANG; Lanjihong MA; Masashi SUGIYAMA

doi:10.1007/s11704-026-60860-y

Front. Comput. Sci. ›› DOI: 10.1007/s11704-026-60860-y

RESEARCH ARTICLE

Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

Author information +

History +

PDF (1444KB)

Abstract

Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in this setting is exploration, which requires algorithms that enable the LLMs to generate informative comparisons that improve sample-efficiency in online RLHF. Existing exploration strategies often derive bonuses via on-policy expectations, which are difficult to estimate reliably from the limited historical preference data available during training; as a result, the policy can prematurely down-weight under-explored regions that may contain high-value behaviors. In this paper, we propose data-dependent exploration for preference optimization (DEPO), a simple and scalable method that leverages historical data to construct an extra uncertainty bonus for high-uncertainty regions, encouraging exploration toward potentially high-value data. Theoretically, we provide a data-dependent regret bound for the proposed algorithm, showing that it adapts to the hardness of the learning task itself and can be tighter than worst-case bounds in practice. Empirically, the proposed method consistently outperforms strong baselines across benchmarks, demonstrating improved sample efficiency.

Keywords

Reinforcement Learning from Human Feedback / Preference Learning / Online Learning

Cite this article

Download citation ▾

Zhen-Yu ZHANG, Yuting TANG, Jiandong ZHANG, Lanjihong MA, Masashi SUGIYAMA. Data-dependent Exploration for Online Reinforcement Learning from Human Feedback. Front. Comput. Sci. DOI:10.1007/s11704-026-60860-y

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

RIGHTS & PERMISSIONS

Higher Education Press 2026

PDF (1444KB)

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submission

Call for papers

Guidelines for authors

Download templates

Guidelines for reviewers

Abstract

Keywords

Cite this article

References

RIGHTS & PERMISSIONS

Just Accepted