Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning

By: Zhuoming Chen, Hongyi Liu, Yang Zhou, Haizhong Zheng, Beidi Chen

Published: 2026-02-09

View on arXiv →
#cs.AI

Abstract

This paper introduces 'Jackpot,' a framework designed to improve the efficiency of reinforcement learning (RL) for large language models (LLMs) by reducing the distribution mismatch between the rollout model and the evolving policy. It offers significant potential for reducing computational costs in RL.

FEEDBACK

Projects

No projects yet

Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning | ArXiv Intelligence