In this episode:
• Introduction: The Optimizer Zoo: Professor Norris and Linda introduce the topic of optimization in LLMs, joking about the explosion of new optimizers before introducing the paper of the week: TEON.
• The Muon Foundation: Linda recaps the Muon optimizer, explaining how it uses orthogonalization to prevent gradient rank collapse, while Norris questions its limitations regarding layer independence.
• Enter the Tensor: How TEON Works: Linda explains the core innovation of TEON: stacking gradients from multiple layers into a tensor and using matricization to orthogonalize them jointly.
• The Theory: Singular Vector Alignment: The hosts discuss the theoretical justification, focusing on Proposition 4.6 and why gradients in Transformers (specifically Q, K, and V) exhibit strong singular vector alignment.
• Results and The Polar Express: A look at the experimental results on GPT and LLaMA models, confirming TEON outperforms Muon even when using approximate SVD methods like PolarExpress.
• Conclusion: Professor Norris concedes that TEON offers a principled improvement over Muon, and the duo signs off.