Share Lost in Backpropagation- The LM Head is a Gradient Bottleneck

Copy link

March 28, 2026

Lost in Backpropagation- The LM Head is a Gradient Bottleneck

21 minutes

In this episode:
• Chapter 1: Introduction to the Bottleneck: Linda introduces the paper and the general concept of the LM head. Professor Norris expresses initial skepticism about revisiting the softmax bottleneck.
• Chapter 2: Expressivity vs. Optimization: The hosts discuss how the paper shifts the focus from the classical expressivity limitation to a fundamental optimization problem.
• Chapter 3: The Math of Gradient Destruction: Linda breaks down the matrix math, explaining how backpropagating a V-dimensional gradient through a rank-D layer destroys up to 99 percent of the gradient norm.
• Chapter 4: SpamLang and Real-World Evidence: The discussion moves to the SpamLang synthetic task and 2B parameter pretraining experiments, proving that the gradient bottleneck severely limits training speed and capacity.
• Chapter 5: Implications for Scaling Laws: Norris and Linda wrap up by discussing what this means for the future of LLM pretraining and potential architectural fixes.

...more

View all episodes

By Mechanical Dirk

March 28, 2026

Lost in Backpropagation- The LM Head is a Gradient Bottleneck

21 minutes

...more

Sign up to save your podcasts