In this episode:
• Introduction: The Alchemy of Training: Professor Norris laments the 'black magic' of hyperparameter tuning, and Linda introduces the paper 'Predictable Scale: Part I, Step Law' which promises to turn that alchemy into science.
• The Million-Hour Experiment: The hosts discuss the unprecedented scale of the study, involving 3,700 models and nearly one million H800 GPU hours, to map the loss landscape.
• Defining the Step Law: Linda explains the core mathematical findings: how Learning Rate scales with model size (N) and data size (D), and the surprising revelation that optimal Batch Size depends almost entirely on D, not N.
• Universality: MoEs and Data Recipes: A deep dive into how the Step Law holds up against sparse Mixture-of-Experts models and varying data distributions (like code or multilingual data), outperforming previous scaling laws like DeepSeek or OpenAI's.
• Conclusion: A Plug-and-Play Future: Norris concedes that the empirical evidence is overwhelming. They wrap up with the implications for efficient LLM training and what this means for the industry.