Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

This paper introduces On-Policy Self-Distillation (OPSD), a novel framework enabling a single Large Language Model (LLM) to act as both teacher and student to significantly enhance its mathematical reasoning abilities. OPSD leverages ground-truth solutions as privileged information for the teacher policy, providing dense token-level supervision to the student policy during its own rollouts. This approach effectively addresses the distribution mismatch inherent in off-policy distillation and overcomes the limitations of sparse rewards in reinforcement learning, leading to 4-8x higher token efficiency and superior performance on challenging mathematical reasoning benchmarks.

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Abstract

Projects