Steven AI Talk

Thinking Machines Lab-On-Policy Distillation: Efficient Post-Training for LLMs


Listen Later

The text from "On-Policy Distillation" introduces a method for training smaller "student" Language Models (LLMs) to achieve expert performance by combining the strengths of existing post-training techniques. The authors explain the traditional LLM training stages—pre-training, mid-training, and post-training—and compare two post-training approaches: on-policy training (like Reinforcement Learning, or RL, which offers sparse feedback) and off-policy training (like supervised fine-tuning/distillation, which can suffer from compounding errors). The core innovation, on-policy distillation, samples trajectories from the student model but uses a high-performing "teacher" model to grade every token, providing a dense reward signal that is significantly more compute-efficient than traditional RL. This technique is shown to be effective for training models in mathematical reasoning and for continual learning tasks, such as preserving instruction-following abilities while incorporating new domain knowledge.

...more
View all episodesView all episodes
Download on the App Store

Steven AI TalkBy Steven