Hal Turing and Dr. Ada Shannon open by situating the Dao-Gu paper — "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality" (arXiv 2405.21060, ICML 2024) — within the bifurcated landscape of sequence modeling. For years, Transformer researchers and SSM researchers developed in parallel, unable to borrow optimizations across the divide. The episode traces the lineage from HiPPO and S4 through Mamba's selective state spaces, explaining why SSMs' linear-time theoretical advantage never translated to wall-clock wins: the GPU ecosystem was built for dense matrix multiply, and SSMs lacked the tooling that FlashAttention brought to attention. The hosts also credit the direct intellectual ancestor — Katharopoulos et al.'s 2020 "Transformers are RNNs" — which showed that softmax attention with a kernel approximation reduces to a linear recurrence, establishing the conceptual template Dao and Gu would later formalize.
The core of the episode is a careful unpacking of structured semiseparable matrices, a class of objects from numerical linear algebra — Kalman filter theory, PDE solvers — entirely unknown to the ML community until Dao and Gu made the connection. Every entry of a causal SSM input-output matrix has the form C[i] times a chain of transition matrices A times B[j], which is precisely the generator representation of a rank-d semiseparable matrix. Shannon walks through the O(n) factored form — M[i,j] equals u[i] times v[j] below the diagonal — and explains how this structure encodes both the SSM recurrence and the masked attention computation as two views of the same algebraic object. The canonical reference is Vandebril, Van Barel, and Mastronardi's two-volume work from Johns Hopkins, 2008, a body of theory the ML community had never encountered. Once the connection is made, hardware-efficient algorithms from one domain port directly to the other.
But the episode frames this mathematical achievement against a harder question raised by subsequent theoretical work: the L²M Condition of Chen et al. and the bipartite mutual information scaling law. The duality shows that SSM computation and attention computation are equivalent representations — but equivalence of computation does not imply equivalence of information retention. SSMs compress sequence history into a fixed-size state regardless of context length; the Transformer's KV-cache grows linearly, retaining more as context expands. The mutual information scaling law formalizes this gap: capturing the multi-token dependencies present in natural language requires a history state that grows with context length. The episode closes on what this implies for hybrid architectures — systems that combine SSM efficiency with selective attention — and whether the theoretical unification Dao and Gu achieved changes how practitioners should think about where compressed state fails.
Sources:
1. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Tri Dao, Albert Gu, 2024
http://arxiv.org/abs/2405.21060
2. https://arxiv.org/pdf/2503.04725
3. On a New Class of Structured Matrices — Yuli Eidelman, Israel Gohberg, 1997
https://scholar.google.com/scholar?q=On+a+New+Class+of+Structured+Matrices
4. Time-Varying Systems and Computations — Patrick Dewilde, Alle-Jan van der Veen, 1998
https://scholar.google.com/scholar?q=Time-Varying+Systems+and+Computations
5. Matrix Computations with Semiseparable Matrices (Volumes 1 & 2) — Raf Vandebril, Marc Van Barel, Nicola Mastronardi, 2008
https://scholar.google.com/scholar?q=Matrix+Computations+with+Semiseparable+Matrices+(Volumes+1+&+2)
6. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Tri Dao, Albert Gu, 2024
https://scholar.google.com/scholar?q=Transformers+are+SSMs:+Generalized+Models+and+Efficient+Algorithms+Through+Structured+State+Space+Duality
7. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention — Katharopoulos, Vyas, Pappas, Fleuret, 2020
https://scholar.google.com/scholar?q=Transformers+are+RNNs:+Fast+Autoregressive+Transformers+with+Linear+Attention
8. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Gu, Dao, 2023
https://scholar.google.com/scholar?q=Mamba:+Linear-Time+Sequence+Modeling+with+Selective+State+Spaces
9. Semiseparable Matrices in Linear Algebra, Science and Engineering (2 vols.) — Vandebril, Van Barel, Mastronardi, 2008
https://scholar.google.com/scholar?q=Semiseparable+Matrices+in+Linear+Algebra,+Science+and+Engineering+(2+vols.)
10. Retentive Network: A Successor to Transformer for Large Language Models — Sun, Li, Dong, Huang, Peng, Liu, Wang, Lin, Yuan, Chen, Wei, 2023
https://scholar.google.com/scholar?q=Retentive+Network:+A+Successor+to+Transformer+for+Large+Language+Models
11. RWKV: Reinventing RNNs for the Transformer Era — Peng et al., 2023
https://scholar.google.com/scholar?q=RWKV:+Reinventing+RNNs+for+the+Transformer+Era
12. Gated Linear Attention Transformers with Hardware-Efficient Training — Yang, Wang, Shen, Peng, Dao, Gu, Matoba, 2023
https://scholar.google.com/scholar?q=Gated+Linear+Attention+Transformers+with+Hardware-Efficient+Training
13. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Dao, 2024
https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning
14. Zoology: Measuring and Improving Recall in Efficient Language Models — Arora, Eyuboglu, Timalsina, Johnson, Poli, Zou, Rudra, Ré, 2024
https://scholar.google.com/scholar?q=Zoology:+Measuring+and+Improving+Recall+in+Efficient+Language+Models
15. Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention — unknown (snippet only), 2024-2025
https://scholar.google.com/scholar?q=Overcoming+Long-Context+Limitations+of+State-Space+Models+via+Context-Dependent+Sparse+Attention
16. Bayesian Optimality of In-Context Learning with Selective State Spaces — unknown (snippet only), 2024-2025
https://scholar.google.com/scholar?q=Bayesian+Optimality+of+In-Context+Learning+with+Selective+State+Spaces
17. On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective — unknown (snippet only), 2024-2025
https://scholar.google.com/scholar?q=On+the+Expressiveness+of+Softmax+Attention:+A+Recurrent+Neural+Network+Perspective
18. Bridging the Divide: Reconsidering Softmax and Linear Attention — unknown (snippet only), 2024-2025
https://scholar.google.com/scholar?q=Bridging+the+Divide:+Reconsidering+Softmax+and+Linear+Attention
19. Softmax Linear Attention: Reclaiming Global Competition — unknown (snippet only), 2024-2025
https://scholar.google.com/scholar?q=Softmax+Linear+Attention:+Reclaiming+Global+Competition
20. Hybrid Architectures for Language Models: Systematic Analysis and Design Insights — unknown (snippet only), 2024-2025
https://scholar.google.com/scholar?q=Hybrid+Architectures+for+Language+Models:+Systematic+Analysis+and+Design+Insights
21. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — unknown (snippet only), 2024-2025
https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse
22. AI Post Transformers: FlashAttention-4 Conquers Asymmetric GPU Hardware Scaling — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-06-flashattention-4-conquers-asymmetric-gpu-78839b.mp3
23. AI Post Transformers: ATTENTION2D and lean attention: Distributed Self-Attention — Hal Turing & Dr. Ada Shannon, 2025
https://podcasters.spotify.com/pod/show/12146088098/episodes/ATTENTION2D-and-lean-attention-Distributed-Self-Attention-e3a7r4n