Mechanical Dreams

Lost in Backpropagation- The LM Head is a Gradient Bottleneck


Listen Later

In this episode:
• Chapter 1: Introduction to the Bottleneck: Linda introduces the paper and the general concept of the LM head. Professor Norris expresses initial skepticism about revisiting the softmax bottleneck.
• Chapter 2: Expressivity vs. Optimization: The hosts discuss how the paper shifts the focus from the classical expressivity limitation to a fundamental optimization problem.
• Chapter 3: The Math of Gradient Destruction: Linda breaks down the matrix math, explaining how backpropagating a V-dimensional gradient through a rank-D layer destroys up to 99 percent of the gradient norm.
• Chapter 4: SpamLang and Real-World Evidence: The discussion moves to the SpamLang synthetic task and 2B parameter pretraining experiments, proving that the gradient bottleneck severely limits training speed and capacity.
• Chapter 5: Implications for Scaling Laws: Norris and Linda wrap up by discussing what this means for the future of LLM pretraining and potential architectural fixes.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk