In this episode:
• Dessert Before Vegetables?: Professor Norris and Linda introduce the concept of Curriculum Learning in LLMs and discuss why the intuitive idea of saving the best data for last has historically failed to produce significant results.
• The Invisible antagonist: Learning Rate Decay: Linda reveals the paper's core insight: standard learning rate schedules decay to near-zero just as the high-quality data arrives, effectively wasting the most valuable training tokens.
• Signal, Noise, and the River Valley: The hosts discuss the theoretical mechanism, using a 'river valley' analogy to explain how high-quality data provides a strong signal direction that is dampened by aggressive optimization schedules.
• The Solution: Curriculum Model Averaging (CMA): Linda details the paper's proposed method: replacing learning rate decay with a constant learning rate combined with weight averaging (EMA) to stabilize the model while keeping it plastic enough to learn from good data.
• Results at Scale: A deep dive into the experimental results on 1.5B parameter models, showing how this new regime outperforms random shuffling by over 1.6% on standard benchmarks.
• Rethinking the Pretraining Recipe: Professor Norris concedes the brilliance of the approach, and the two discuss the broader implications for mid-training and the necessity of co-designing data curricula with optimization hyperparameters.