Reinforcement Learning via Self-Distillation

This research introduces Self-Distillation Policy Optimization (SDPO), an on-policy reinforcement learning algorithm designed to significantly improve Large Language Model (LLM) performance by effectively utilizing rich, tokenized environment feedback. Unlike traditional methods that rely on sparse scalar rewards, SDPO converts this detailed feedback into a dense learning signal, enabling LLMs to learn from their own explained mistakes without needing an external teacher model. This innovative approach substantially enhances sample efficiency and accuracy across complex reasoning and coding tasks, accelerating discovery on difficult binary-reward tasks with fewer attempts and promising more efficient LLM training.

Reinforcement Learning via Self-Distillation

Abstract

Projects