In this episode:
• Introduction: The Optimizer Menagerie: Professor Norris and Linda kick off the episode by discussing the explosion of new optimizers in the LLM space. Linda introduces 'NorMuon,' a paper from Georgia Tech and Microsoft that attempts to bridge the gap between the industry standard, AdamW, and the geometric newcomer, Muon.
• The Geometry Problem: Why Adam and Muon Fall Short: Linda explains the fundamental trade-off: Adam handles coordinate-wise scaling well but ignores matrix geometry, while Muon fixes the geometry via orthogonalization but suffers from imbalanced update norms across neurons. Norris challenges the necessity of fixing Muon, prompting a discussion on 'condition numbers' versus 'neuron norms.'
• The NorMuon Solution: Best of Both Worlds: The hosts dive into the algorithm itself, detailing how NorMuon applies neuron-wise adaptive learning rates (similar to Adam-mini) *after* Muon's orthogonalization step. They discuss the intuition behind using second-order momentum to normalize the disparate scales of neuron updates.
• Engineering at Scale: FSDP2 and Distributed Newton-Schulz: The discussion shifts to the systems engineering required to make this work on large clusters. Linda explains how the authors implemented NorMuon under the FSDP2 framework, specifically how they distribute the expensive Newton-Schulz orthogonalization across devices to avoid redundant computation.
• Results and Verdict: Efficiency Gains: Norris reviews the empirical results, noting the 21% efficiency gain over Adam on 1.1B parameter models and the impressive memory savings. The episode concludes with a consensus that orthogonalization and adaptive scaling are complementary, not competitive, technologies.