In this episode:
• Beyond Adam: The Great Optimizer Bake-Off: Linda introduces a paper questioning the decade-long reign of the AdamW optimizer for training large language models. Professor Norris expresses his healthy skepticism about the endless stream of 'new and improved' optimizers.
• Adam's Kingdom and Its Challengers: The hosts discuss why AdamW became the default and the paper's motivation: the lack of systematic, fair comparisons between the many new optimizers claiming to be better. Professor Norris recalls past optimizer fads.
• Creating a Level Playing Field: Linda details the paper's rigorous experimental setup, covering the 11 optimizers tested and the massive hyperparameter tuning effort required for a fair fight. Professor Norris is impressed by the scale of the benchmark.
• And the Winner Is... It's Complicated: Linda reveals the main results, highlighting that AdEMAMix and MARS are the new frontrunners, especially at scale. They break down the results from the paper's many graphs, discussing where different optimizers shine.
• Actionable Advice for the Practitioner: Professor Norris and Linda distill the paper's 'takeaways' into practical advice for listeners. They discuss the critical and often overlooked role of weight decay, learning rate schedules, and warmup.
• The Optimization Frontier: The hosts conclude that while AdamW's dominance is over, the best optimizer is context-dependent. They wrap up by discussing the paper's impact and the future of optimization research.