Share On-Policy Self-Distillation for Advanced LLM Reasoning

Copy link

February 06, 2026

On-Policy Self-Distillation for Advanced LLM Reasoning

37 minutes

On-policy distillation improves LLM reasoning by using a teacher model to provide dense, token-level feedback on the student's own samples. Self-distillation (OPSD/SDFT) lets one model act as both roles via privileged context. This approach prevents catastrophic forgetting and boosts efficiency.Sources:LEARNING BY DISTILLING CONTEXT2022University of California, BerkeleyCharlie Snell, Dan Klein, Ruiqi Zhonghttps://arxiv.org/pdf/2209.15189ON-POLICY DISTILLATION OF LANGUAGE MODELS: LEARNING FROM SELF-GENERATED MISTAKES2024Google DeepMind, Mila, University of TorontoRishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachemhttps://arxiv.org/pdf/2306.13649On-Policy DistillationOct 27, 2025Thinking Machines LabKevin Lu, Thinking Machines Labhttps://thinkingmachines.ai/blog/on-policy-distillationSelf-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models2026UCLA, HKU, Meta Superintelligence LabsSiyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Groverhttps://arxiv.org/pdf/2601.18734SELF-DISTILLATION ENABLES CONTINUAL LEARNING2026MIT, Improbable AI Lab, ETH ZurichIdan Shenfeld, Mehul Damani, Jonas Hubotter, Pulkit Agrawalhttps://arxiv.org/pdf/2601.19897

...more

View all episodes

By mcgrof

February 06, 2026

On-Policy Self-Distillation for Advanced LLM Reasoning

37 minutes

...more

Sign up to save your podcasts