Reinforcement Learning via Self-Distillation

By: Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, Andreas Krause

Published: 2026-01-28

View on arXiv →
#cs.AI

Abstract

This research introduces Self-Distillation Policy Optimization (SDPO), an on-policy reinforcement learning algorithm designed to significantly improve Large Language Model (LLM) performance by effectively utilizing rich, tokenized environment feedback. Unlike traditional methods that rely on sparse scalar rewards, SDPO converts this detailed feedback into a dense learning signal, enabling LLMs to learn from their own explained mistakes without needing an external teacher model. This innovative approach substantially enhances sample efficiency and accuracy across complex reasoning and coding tasks, accelerating discovery on difficult binary-reward tasks with fewer attempts and promising more efficient LLM training.

FEEDBACK

Projects

No projects yet

Reinforcement Learning via Self-Distillation | ArXiv Intelligence