In this episode:
• Introduction: The Transformer's Kryptonite: Professor Norris jokes about Transformers solving everything, but Linda introduces a new paper that challenges their ability to perform basic state tracking efficiently. They set the stage by distinguishing between the well-known Out-of-Distribution failures and the paper's focus on In-Distribution data efficiency.
• The Setup: Modulo Arithmetic and Supervision Regimes: Linda explains the experimental setup using modular addition and permutation composition, and defines the three supervision formats: Outcome Supervision, Chain-of-Thought (CoT), and Aligned CoT. Norris questions why simple math requires such complex architectures, leading to a discussion on sample efficiency.
• The Showdown: Transformers vs. RNNs: The hosts discuss the surprising results where recurrent models (LSTMs and Dense-SSMs) crush Transformers in outcome supervision. They analyze why Transformers rely heavily on Chain-of-Thought to function, whereas RNNs struggle with standard CoT due to recall bottlenecks but excel with Aligned CoT.
• The Core Theory: Induction Bias and The Sharing Factor: Linda dives into the concept of the "Sharing Factor" (kappa), explaining that RNNs use an inductive bias to share weights across sequence lengths, effectively learning the algorithm. Norris is fascinated by the finding that Transformers exhibit "length isolation," essentially relearning the task from scratch for every new sequence length.
• Conclusion: Brute Force vs. True Learning: The pair wraps up by discussing the implications for Large Language Models, specifically regarding "context rot" and the massive data requirements for agentic workflows. Norris concedes that perhaps we haven't solved state tracking just yet, and they sign off.