Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
By: Siyan Zhao, Zhihui Xie, Mengchen Liu, Xiangchen Song, Haoyang Li, Yuhui Li, Yizhou Wang
Published: 2026-01-26
View on arXiv →Abstract
This paper introduces On-Policy Self-Distillation (OPSD), a novel framework enabling a single Large Language Model (LLM) to act as both teacher and student to significantly enhance its mathematical reasoning abilities. OPSD leverages ground-truth solutions as privileged information for the teacher policy, providing dense token-level supervision to the student policy during its own rollouts. This approach effectively addresses the distribution mismatch inherent in off-policy distillation and overcomes the limitations of sparse rewards in reinforcement learning, leading to 4-8x higher token efficiency and superior performance on challenging mathematical reasoning benchmarks.