October 06, 2025

Drop-Muon- Update Less, Converge Faster

11 minutes

In this episode:
• Introduction: Less is More in Optimization?: Professor Norris and Linda introduce the "Drop-Muon" paper, which challenges the fundamental assumption that all neural network layers must be updated at every training step. They set the stage by questioning if selectively updating layers could lead to faster convergence.
• A Refresher on the Muon Family: Linda provides a high-level overview of modern non-Euclidean optimizers like Muon, Scion, and Gluon. They discuss how these methods use layer-specific geometry to improve training, which provides the foundation for the Drop-Muon approach.
• The Drop-Muon Algorithm: Randomized Progressive Training: Linda explains the core mechanism of Drop-Muon, focusing on how it samples a random subset of layers to update at each iteration. Professor Norris probes the practicalities of this approach, especially the concept of 'Randomized Progressive Training' and its computational cost.
• The Theoretical Justification: When is Full-Network Update Optimal?: The hosts delve into the paper's theoretical contributions, highlighting the key finding that full-network updates are only optimal under a very restrictive and unlikely condition on layer smoothness constants. They discuss the implications of the cost model, which accounts for backpropagation and parameter update costs.
• Empirical Results and Final Thoughts: Linda presents the experimental results, which show Drop-Muon achieving the same accuracy as standard Muon up to 1.4x faster in wall-clock time. They conclude by discussing the practical impact of this 'update less, converge faster' strategy for training large models.

...more

View all episodes

By Mechanical Dirk

October 06, 2025

Drop-Muon- Update Less, Converge Faster

11 minutes

...more

Share Drop-Muon- Update Less, Converge Faster

Sign up to save your podcasts

Drop-Muon- Update Less, Converge Faster

Drop-Muon- Update Less, Converge Faster