Mechanical Dreams

Predictable Scale


Listen Later

In this episode:
• Introduction: The Alchemy of Training: Professor Norris laments the 'black magic' of hyperparameter tuning, and Linda introduces the paper 'Predictable Scale: Part I, Step Law' which promises to turn that alchemy into science.
• The Million-Hour Experiment: The hosts discuss the unprecedented scale of the study, involving 3,700 models and nearly one million H800 GPU hours, to map the loss landscape.
• Defining the Step Law: Linda explains the core mathematical findings: how Learning Rate scales with model size (N) and data size (D), and the surprising revelation that optimal Batch Size depends almost entirely on D, not N.
• Universality: MoEs and Data Recipes: A deep dive into how the Step Law holds up against sparse Mixture-of-Experts models and varying data distributions (like code or multilingual data), outperforming previous scaling laws like DeepSeek or OpenAI's.
• Conclusion: A Plug-and-Play Future: Norris concedes that the empirical evidence is overwhelming. They wrap up with the implications for efficient LLM training and what this means for the industry.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk