Share Why Warmup the Learning Rate

Copy link

April 09, 2026

Why Warmup the Learning Rate

22 minutes

In this episode:
• Introduction: The Mystery of Warmup: Linda introduces a new NeurIPS 2024 paper that questions the true purpose of learning rate warmup. Professor Norris shares the conventional, yet incomplete, wisdom behind the practice.
• Tolerating Larger Learning Rates and the Sharpness Factor: The hosts discuss the paper's central claim that warmup's main benefit is allowing models to tolerate larger target learning rates by moving them to flatter regions of the loss landscape.
• Catapults and the Edge of Stability: Linda dives into the technical details of loss catapults and how progressive sharpening and natural sharpness reduction guide the network during early training stages.
• Adam, Pre-conditioned Sharpness, and Training Failures: Professor Norris and Linda explore how these mechanisms apply to adaptive optimizers like Adam, and why Adam experiences catastrophic training failures instead of standard divergence.
• GI-Adam and Better Initialization Strategies: The hosts review the paper's practical improvements, including Gradient Initialized Adam and strategies for estimating the initial learning rate to save compute time.
• Conclusion and Final Thoughts: Professor Norris concedes the brilliance of the paper's arguments, and the hosts wrap up the episode with key takeaways for deep learning practitioners.

...more

View all episodes

By Mechanical Dirk

April 09, 2026

Why Warmup the Learning Rate

22 minutes

...more

Sign up to save your podcasts