
Sign up to save your podcasts
Or


The text from "On-Policy Distillation" introduces a method for training smaller "student" Language Models (LLMs) to achieve expert performance by combining the strengths of existing post-training techniques. The authors explain the traditional LLM training stages—pre-training, mid-training, and post-training—and compare two post-training approaches: on-policy training (like Reinforcement Learning, or RL, which offers sparse feedback) and off-policy training (like supervised fine-tuning/distillation, which can suffer from compounding errors). The core innovation, on-policy distillation, samples trajectories from the student model but uses a high-performing "teacher" model to grade every token, providing a dense reward signal that is significantly more compute-efficient than traditional RL. This technique is shown to be effective for training models in mathematical reasoning and for continual learning tasks, such as preserving instruction-following abilities while incorporating new domain knowledge.
By StevenThe text from "On-Policy Distillation" introduces a method for training smaller "student" Language Models (LLMs) to achieve expert performance by combining the strengths of existing post-training techniques. The authors explain the traditional LLM training stages—pre-training, mid-training, and post-training—and compare two post-training approaches: on-policy training (like Reinforcement Learning, or RL, which offers sparse feedback) and off-policy training (like supervised fine-tuning/distillation, which can suffer from compounding errors). The core innovation, on-policy distillation, samples trajectories from the student model but uses a high-performing "teacher" model to grade every token, providing a dense reward signal that is significantly more compute-efficient than traditional RL. This technique is shown to be effective for training models in mathematical reasoning and for continual learning tasks, such as preserving instruction-following abilities while incorporating new domain knowledge.