OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

This paper proposes OGER, a novel hybrid reinforcement learning framework that synergistically integrates offline expert guidance with online exploratory discovery through a specialized reward modeling lens. OGER utilizes multi-teacher collaborative training and constructs an auxiliary exploration reward that benchmarks online trajectories against an ensemble of high-quality offline teacher trajectories by divergence. This mechanism incentivizes autonomous exploration and promotes discovery beyond imitation, achieving significant performance gains in mathematical and general reasoning benchmarks while ensuring robust generalization to out-of-domain tasks.

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

Abstract

Projects