OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
By: Xinyu Ma, Mingzhou Xu, Xuebo Liu, Chang Jin, Qiang Wang, Derek F. Wong, Min Zhang
Published: 2026-04-21
View on arXiv →Abstract
This paper proposes OGER, a novel hybrid reinforcement learning framework that synergistically integrates offline expert guidance with online exploratory discovery through a specialized reward modeling lens. OGER utilizes multi-teacher collaborative training and constructs an auxiliary exploration reward that benchmarks online trajectories against an ensemble of high-quality offline teacher trajectories by divergence. This mechanism incentivizes autonomous exploration and promotes discovery beyond imitation, achieving significant performance gains in mathematical and general reasoning benchmarks while ensuring robust generalization to out-of-domain tasks.