Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning
By: Zhuoming Chen, Hongyi Liu, Yang Zhou, Haizhong Zheng, Beidi Chen
Published: 2026-02-09
View on arXiv →#cs.AI
Abstract
This paper introduces 'Jackpot,' a framework designed to improve the efficiency of reinforcement learning (RL) for large language models (LLMs) by reducing the distribution mismatch between the rollout model and the evolving policy. It offers significant potential for reducing computational costs in RL.