In this episode:
• Introduction: The Mystery of Warmup: Linda introduces a new NeurIPS 2024 paper that questions the true purpose of learning rate warmup. Professor Norris shares the conventional, yet incomplete, wisdom behind the practice.
• Tolerating Larger Learning Rates and the Sharpness Factor: The hosts discuss the paper's central claim that warmup's main benefit is allowing models to tolerate larger target learning rates by moving them to flatter regions of the loss landscape.
• Catapults and the Edge of Stability: Linda dives into the technical details of loss catapults and how progressive sharpening and natural sharpness reduction guide the network during early training stages.
• Adam, Pre-conditioned Sharpness, and Training Failures: Professor Norris and Linda explore how these mechanisms apply to adaptive optimizers like Adam, and why Adam experiences catastrophic training failures instead of standard divergence.
• GI-Adam and Better Initialization Strategies: The hosts review the paper's practical improvements, including Gradient Initialized Adam and strategies for estimating the initial learning rate to save compute time.
• Conclusion and Final Thoughts: Professor Norris concedes the brilliance of the paper's arguments, and the hosts wrap up the episode with key takeaways for deep learning practitioners.