In this episode:
• Introduction: Hitting the Data Wall: Professor Norris and Linda introduce the episode's paper, 'OPUS', and discuss the looming 'Data Wall' where high-quality public text is exhausted, necessitating a shift from more tokens to better tokens.
• The Flaw in Current Data Selection: The hosts debate existing methods, contrasting static filters like FineWeb-Edu with dynamic selection. Linda explains why scoring data based on raw gradients fails when modern optimizers like AdamW or Muon reshape the update geometry.
• Defining Utility in the Optimizer's World: Linda breaks down the core mechanism of OPUS: measuring data utility in the optimizer-induced update space rather than the raw gradient space. Norris grapples with the concept of aligning data selection with the actual trajectory of the optimization.
• Scaling Up: Ghosts and Sketches: A deep dive into how OPUS makes per-sample gradient estimation computationally feasible. The discussion covers the use of the 'Ghost' technique combined with CountSketch to project updates into low-dimensional space without full materialization.
• Diversity via Boltzmann and The Proxy: The hosts discuss how OPUS avoids 'diversity collapse' using Boltzmann sampling instead of greedy selection, and how it constructs a stable 'Bench-Proxy' from the pre-training corpus to guide the model.
• Results and Final Thoughts: Reviewing the empirical results where OPUS outperforms industrial baselines on GPT-2 and Qwen3-8B. Norris concedes the cleverness of the approach, and they wrap up with thoughts on data efficiency.