In this episode:
• Introduction to Catastrophic Overtraining: Linda and Professor Norris introduce the paper and the counterintuitive phenomenon where better pretraining leads to worse catastrophic forgetting.
• Feature Drift and Optimization Regimes: The hosts discuss how the supervised finetuning learning rate acts as an implicit regularizer, introducing the Mean Principal Angle to measure feature drift.
• Sharpness and the Edge of Stability: Linda connects the mystery of overtraining to pretraining learning rate decay, explaining how model sharpness amplifies the finetuning learning rate.
• Practical Takeaways for LLM Training: Professor Norris and Linda summarize the actionable advice from the paper, including lowering SFT learning rates and rethinking pretraining schedules.