March 07, 2026

Structured State Space Duality Unifies Transformers and SSMs

Hal Turing and Dr. Ada Shannon open by situating the Dao-Gu paper — "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality" (arXiv 2405.21060, ICML 2024) — within the bifurcated landscape of sequence modeling. For years, Transformer researchers and SSM researchers developed in parallel, unable to borrow optimizations across the divide. The episode traces the lineage from HiPPO and S4 through Mamba's selective state spaces, explaining why SSMs' linear-time theoretical advantage never translated to wall-clock wins: the GPU ecosystem was built for dense matrix multiply, and SSMs lacked the tooling that FlashAttention brought to attention. The hosts also credit the direct intellectual ancestor — Katharopoulos et al.'s 2020 "Transformers are RNNs" — which showed that softmax attention with a kernel approximation reduces to a linear recurrence, establishing the conceptual template Dao and Gu would later formalize.

The core of the episode is a careful unpacking of structured semiseparable matrices, a class of objects from numerical linear algebra — Kalman filter theory, PDE solvers — entirely unknown to the ML community until Dao and Gu made the connection. Every entry of a causal SSM input-output matrix has the form C[i] times a chain of transition matrices A times B[j], which is precisely the generator representation of a rank-d semiseparable matrix. Shannon walks through the O(n) factored form — M[i,j] equals u[i] times v[j] below the diagonal — and explains how this structure encodes both the SSM recurrence and the masked attention computation as two views of the same algebraic object. The canonical reference is Vandebril, Van Barel, and Mastronardi's two-volume work from Johns Hopkins, 2008, a body of theory the ML community had never encountered. Once the connection is made, hardware-efficient algorithms from one domain port directly to the other.

But the episode frames this mathematical achievement against a harder question raised by subsequent theoretical work: the L²M Condition of Chen et al. and the bipartite mutual information scaling law. The duality shows that SSM computation and attention computation are equivalent representations — but equivalence of computation does not imply equivalence of information retention. SSMs compress sequence history into a fixed-size state regardless of context length; the Transformer's KV-cache grows linearly, retaining more as context expands. The mutual information scaling law formalizes this gap: capturing the multi-token dependencies present in natural language requires a history state that grows with context length. The episode closes on what this implies for hybrid architectures — systems that combine SSM efficiency with selective attention — and whether the theoretical unification Dao and Gu achieved changes how practitioners should think about where compressed state fails.

Sources:

1. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Tri Dao, Albert Gu, 2024

http://arxiv.org/abs/2405.21060

2. https://arxiv.org/pdf/2503.04725

3. On a New Class of Structured Matrices — Yuli Eidelman, Israel Gohberg, 1997

https://scholar.google.com/scholar?q=On+a+New+Class+of+Structured+Matrices

4. Time-Varying Systems and Computations — Patrick Dewilde, Alle-Jan van der Veen, 1998

https://scholar.google.com/scholar?q=Time-Varying+Systems+and+Computations

5. Matrix Computations with Semiseparable Matrices (Volumes 1 & 2) — Raf Vandebril, Marc Van Barel, Nicola Mastronardi, 2008

https://scholar.google.com/scholar?q=Matrix+Computations+with+Semiseparable+Matrices+(Volumes+1+&+2)

6. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Tri Dao, Albert Gu, 2024

https://scholar.google.com/scholar?q=Transformers+are+SSMs:+Generalized+Models+and+Efficient+Algorithms+Through+Structured+State+Space+Duality

7. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention — Katharopoulos, Vyas, Pappas, Fleuret, 2020

https://scholar.google.com/scholar?q=Transformers+are+RNNs:+Fast+Autoregressive+Transformers+with+Linear+Attention

8. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Gu, Dao, 2023

https://scholar.google.com/scholar?q=Mamba:+Linear-Time+Sequence+Modeling+with+Selective+State+Spaces

9. Semiseparable Matrices in Linear Algebra, Science and Engineering (2 vols.) — Vandebril, Van Barel, Mastronardi, 2008

https://scholar.google.com/scholar?q=Semiseparable+Matrices+in+Linear+Algebra,+Science+and+Engineering+(2+vols.)

10. Retentive Network: A Successor to Transformer for Large Language Models — Sun, Li, Dong, Huang, Peng, Liu, Wang, Lin, Yuan, Chen, Wei, 2023

https://scholar.google.com/scholar?q=Retentive+Network:+A+Successor+to+Transformer+for+Large+Language+Models

11. RWKV: Reinventing RNNs for the Transformer Era — Peng et al., 2023

https://scholar.google.com/scholar?q=RWKV:+Reinventing+RNNs+for+the+Transformer+Era

12. Gated Linear Attention Transformers with Hardware-Efficient Training — Yang, Wang, Shen, Peng, Dao, Gu, Matoba, 2023

https://scholar.google.com/scholar?q=Gated+Linear+Attention+Transformers+with+Hardware-Efficient+Training

13. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Dao, 2024

https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning

14. Zoology: Measuring and Improving Recall in Efficient Language Models — Arora, Eyuboglu, Timalsina, Johnson, Poli, Zou, Rudra, Ré, 2024

https://scholar.google.com/scholar?q=Zoology:+Measuring+and+Improving+Recall+in+Efficient+Language+Models

15. Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention — unknown (snippet only), 2024-2025

https://scholar.google.com/scholar?q=Overcoming+Long-Context+Limitations+of+State-Space+Models+via+Context-Dependent+Sparse+Attention

16. Bayesian Optimality of In-Context Learning with Selective State Spaces — unknown (snippet only), 2024-2025

https://scholar.google.com/scholar?q=Bayesian+Optimality+of+In-Context+Learning+with+Selective+State+Spaces

17. On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective — unknown (snippet only), 2024-2025

https://scholar.google.com/scholar?q=On+the+Expressiveness+of+Softmax+Attention:+A+Recurrent+Neural+Network+Perspective

18. Bridging the Divide: Reconsidering Softmax and Linear Attention — unknown (snippet only), 2024-2025

https://scholar.google.com/scholar?q=Bridging+the+Divide:+Reconsidering+Softmax+and+Linear+Attention

19. Softmax Linear Attention: Reclaiming Global Competition — unknown (snippet only), 2024-2025

https://scholar.google.com/scholar?q=Softmax+Linear+Attention:+Reclaiming+Global+Competition

20. Hybrid Architectures for Language Models: Systematic Analysis and Design Insights — unknown (snippet only), 2024-2025

https://scholar.google.com/scholar?q=Hybrid+Architectures+for+Language+Models:+Systematic+Analysis+and+Design+Insights

21. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — unknown (snippet only), 2024-2025

https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse

22. AI Post Transformers: FlashAttention-4 Conquers Asymmetric GPU Hardware Scaling — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-06-flashattention-4-conquers-asymmetric-gpu-78839b.mp3

23. AI Post Transformers: ATTENTION2D and lean attention: Distributed Self-Attention — Hal Turing & Dr. Ada Shannon, 2025

https://podcasters.spotify.com/pod/show/12146088098/episodes/ATTENTION2D-and-lean-attention-Distributed-Self-Attention-e3a7r4n

...more

View all episodes

By mcgrof