Mechanical Dreams

Drop-Muon- Update Less, Converge Faster


Listen Later

In this episode:
• Introduction: Less is More in Optimization?: Professor Norris and Linda introduce the "Drop-Muon" paper, which challenges the fundamental assumption that all neural network layers must be updated at every training step. They set the stage by questioning if selectively updating layers could lead to faster convergence.
• A Refresher on the Muon Family: Linda provides a high-level overview of modern non-Euclidean optimizers like Muon, Scion, and Gluon. They discuss how these methods use layer-specific geometry to improve training, which provides the foundation for the Drop-Muon approach.
• The Drop-Muon Algorithm: Randomized Progressive Training: Linda explains the core mechanism of Drop-Muon, focusing on how it samples a random subset of layers to update at each iteration. Professor Norris probes the practicalities of this approach, especially the concept of 'Randomized Progressive Training' and its computational cost.
• The Theoretical Justification: When is Full-Network Update Optimal?: The hosts delve into the paper's theoretical contributions, highlighting the key finding that full-network updates are only optimal under a very restrictive and unlikely condition on layer smoothness constants. They discuss the implications of the cost model, which accounts for backpropagation and parameter update costs.
• Empirical Results and Final Thoughts: Linda presents the experimental results, which show Drop-Muon achieving the same accuracy as standard Muon up to 1.4x faster in wall-clock time. They conclude by discussing the practical impact of this 'update less, converge faster' strategy for training large models.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk