In this episode:
• Chapter 1: Introduction to the Bottleneck: Linda introduces the paper and the general concept of the LM head. Professor Norris expresses initial skepticism about revisiting the softmax bottleneck.
• Chapter 2: Expressivity vs. Optimization: The hosts discuss how the paper shifts the focus from the classical expressivity limitation to a fundamental optimization problem.
• Chapter 3: The Math of Gradient Destruction: Linda breaks down the matrix math, explaining how backpropagating a V-dimensional gradient through a rank-D layer destroys up to 99 percent of the gradient norm.
• Chapter 4: SpamLang and Real-World Evidence: The discussion moves to the SpamLang synthetic task and 2B parameter pretraining experiments, proving that the gradient bottleneck severely limits training speed and capacity.
• Chapter 5: Implications for Scaling Laws: Norris and Linda wrap up by discussing what this means for the future of LLM pretraining and potential architectural fixes.