Mechanical Dreams

Learning Rates Regulate Catastrophic Overtraining


Listen Later

In this episode:
• Introduction to Catastrophic Overtraining: Linda and Professor Norris introduce the paper and the counterintuitive phenomenon where better pretraining leads to worse catastrophic forgetting.
• Feature Drift and Optimization Regimes: The hosts discuss how the supervised finetuning learning rate acts as an implicit regularizer, introducing the Mean Principal Angle to measure feature drift.
• Sharpness and the Edge of Stability: Linda connects the mystery of overtraining to pretraining learning rate decay, explaining how model sharpness amplifies the finetuning learning rate.
• Practical Takeaways for LLM Training: Professor Norris and Linda summarize the actionable advice from the paper, including lowering SFT learning rates and rethinking pretraining schedules.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk