Mechanical Dreams

An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence


Listen Later

In this episode:
• The Multi-Million Dollar NaN: Linda introduces the paper 'An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence' by Zhang et al., setting the stage with the high stakes of expensive pretraining runs failing. Professor Norris expresses skepticism that simple 'bad data' is the root cause of complex divergences.
• The Toxic Five Tokens: The hosts discuss the paper's methodology of injecting synthetic uniform random noise. Linda reveals the counter-intuitive finding that a restricted vocabulary of noise (like repeating hash codes) is significantly more destabilizing than random noise drawn from the full vocabulary.
• Bigger, Deeper, More Fragile: A look at the scaling laws of failure; contrary to the hope that scale fixes everything, Linda explains how the paper proves that deeper models are substantially more likely to diverge when exposed to noise than their wider or smaller counterparts.
• CSI: Gradient Descent: Professor Norris and Linda dive into the forensic diagnostics, distinguishing between failures caused by high learning rates versus noisy data. They discuss the specific 'smoking gun' of maximum attention logits capping at around 1800 for noise-induced failures versus 4000 for learning rate issues.
• MoE Stability and The QK-Fix: They address the concern that Mixture-of-Experts (MoE) models might be hypersensitive to noise, which the paper disproves, and discuss QK-LayerNorm as the architectural 'safety belt' when perfect data cleaning isn't possible.
• Closing Thoughts: Final takeaways on the necessity of data curation and a witty sign-off from Professor Norris regarding the cleanliness of his own reading glasses versus the training data.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk