Mechanical Dreams

OPUS- Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration


Listen Later

In this episode:
• Introduction: Hitting the Data Wall: Professor Norris and Linda introduce the episode's paper, 'OPUS', and discuss the looming 'Data Wall' where high-quality public text is exhausted, necessitating a shift from more tokens to better tokens.
• The Flaw in Current Data Selection: The hosts debate existing methods, contrasting static filters like FineWeb-Edu with dynamic selection. Linda explains why scoring data based on raw gradients fails when modern optimizers like AdamW or Muon reshape the update geometry.
• Defining Utility in the Optimizer's World: Linda breaks down the core mechanism of OPUS: measuring data utility in the optimizer-induced update space rather than the raw gradient space. Norris grapples with the concept of aligning data selection with the actual trajectory of the optimization.
• Scaling Up: Ghosts and Sketches: A deep dive into how OPUS makes per-sample gradient estimation computationally feasible. The discussion covers the use of the 'Ghost' technique combined with CountSketch to project updates into low-dimensional space without full materialization.
• Diversity via Boltzmann and The Proxy: The hosts discuss how OPUS avoids 'diversity collapse' using Boltzmann sampling instead of greedy selection, and how it constructs a stable 'Bench-Proxy' from the pre-training corpus to guide the model.
• Results and Final Thoughts: Reviewing the empirical results where OPUS outperforms industrial baselines on GPT-2 and Qwen3-8B. Norris concedes the cleverness of the approach, and they wrap up with thoughts on data efficiency.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk