In this episode:
• Introduction: Do We Really Need Another Phase?: Professor Norris jokingly laments the ever-expanding terminology of LLM training, while Linda introduces the paper on 'Midtraining' as a distinct, intermediate phase between pretraining and post-training.
• The Mechanism: Building a Distributional Bridge: Linda explains the core theory: midtraining isn't just 'cooling down,' but shifting the model's initialization closer to the target distribution to smooth out the optimization path.
• Results: Where It Works (and Where It Doesn't): The hosts discuss the finding that midtraining shines in 'distant' domains like Code and Math but matters less for general instructions, and cover the surprising reduction in catastrophic forgetting.
• The Plasticity Window: Timing and Mixtures: A deep dive into the interaction between when you start midtraining and how much specialized data you use, highlighting the dangers of late, aggressive data injection.
• Conclusion: Better Than Continued Pretraining?: Norris concedes the method's utility after seeing the comparison against standard continued pretraining, and the pair summarize the practical takeaways for training schedules.