April 16, 2026

Mamba-3 for Efficient Sequence Modeling

21 minutes

This episode explores Mamba-3, a new state space sequence model that argues architecture should be judged not just by perplexity, but by deployment realities like decode latency, throughput, and hardware efficiency. It explains how Mamba-3 revisits earlier Mamba-style models with three main changes—a new exponential-trapezoidal discretization, complex-valued state dynamics, and a MIMO input-output structure—aimed at improving the quality-efficiency tradeoff for long-sequence inference. The discussion also situates the work against transformers, whose KV-cache costs grow with context, and against competing linear-recurrence approaches like DeltaNet and emerging hybrid industry systems. Listeners would find it interesting because it highlights a broader shift in machine learning: whether the future of sequence models will be decided less by benchmark curves alone and more by how well they actually run in production.

Sources:

1. Mamba-3: Improved Sequence Modeling using State Space Principles — Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, Albert Gu, 2026

http://arxiv.org/abs/2603.15569

2. Efficiently Modeling Long Sequences with Structured State Spaces — Albert Gu, Karan Goel, Christopher Re, 2021

https://scholar.google.com/scholar?q=Efficiently+Modeling+Long+Sequences+with+Structured+State+Spaces

3. On the Parameterization and Initialization of Diagonal State Space Models — Albert Gu, Ankit Gupta, Jonathan Berant, Tri Dao, Christopher Re, 2022

https://scholar.google.com/scholar?q=On+the+Parameterization+and+Initialization+of+Diagonal+State+Space+Models

4. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Albert Gu, Tri Dao, 2024

https://scholar.google.com/scholar?q=Mamba:+Linear-Time+Sequence+Modeling+with+Selective+State+Spaces

5. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Tri Dao, Albert Gu, 2024

https://scholar.google.com/scholar?q=Transformers+are+SSMs:+Generalized+Models+and+Efficient+Algorithms+Through+Structured+State+Space+Duality

6. Unitary Evolution Recurrent Neural Networks — Martin Arjovsky, Amar Shah, Yoshua Bengio, 2016

https://scholar.google.com/scholar?q=Unitary+Evolution+Recurrent+Neural+Networks

7. HiPPO: Recurrent Memory with Optimal Polynomial Projections — Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher Re, 2020

https://scholar.google.com/scholar?q=HiPPO:+Recurrent+Memory+with+Optimal+Polynomial+Projections

8. The DeltaNet Family: Efficient Sequence Modeling via State Tracking — Michael Schlag, Kazuki Irie, and Jürgen Schmidhuber; later Gated DeltaNet variants by Shang Yang, Boyuan Wang, Yuhang Zhang, et al., 2021 / 2025

https://scholar.google.com/scholar?q=The+DeltaNet+Family:+Efficient+Sequence+Modeling+via+State+Tracking

9. Rotary Position Embedding — Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu, 2023

https://scholar.google.com/scholar?q=Rotary+Position+Embedding

10. On the Computational Limits of State Space Models and Their Ability to Track State — Ruggero Grazzi, Julien Siems, Arlind Zela, et al., 2025

https://scholar.google.com/scholar?q=On+the+Computational+Limits+of+State+Space+Models+and+Their+Ability+to+Track+State

11. State Space Models Fail at Simple State Tracking Tasks — Aviad Sarrof, Tom Veitsman, and Michael Hahn, 2024

https://scholar.google.com/scholar?q=State+Space+Models+Fail+at+Simple+State+Tracking+Tasks

12. Hungry Hungry Hippos: Towards Language Modeling with State Space Models — Atri Rudra? (No—better to omit uncertain authorship) / H3 team, 2023

https://scholar.google.com/scholar?q=Hungry+Hungry+Hippos:+Towards+Language+Modeling+with+State+Space+Models

13. Kimi Linear — Kimi Team, 2025

https://scholar.google.com/scholar?q=Kimi+Linear

14. Kvzip: Query-agnostic KV Cache Compression with Context Reconstruction — approx. recent LLM systems authors, 2024/2025

https://scholar.google.com/scholar?q=Kvzip:+Query-agnostic+KV+Cache+Compression+with+Context+Reconstruction

15. KVLINK: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. recent LLM systems authors, 2024/2025

https://scholar.google.com/scholar?q=KVLINK:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse

16. KV-CAR: KV Cache Compression using Autoencoders and KV Reuse in Large Language Models — approx. recent LLM systems authors, 2024/2025

https://scholar.google.com/scholar?q=KV-CAR:+KV+Cache+Compression+using+Autoencoders+and+KV+Reuse+in+Large+Language+Models

17. Repeat After Me: Transformers Are Better Than State Space Models at Copying — approx. recent sequence-model authors, 2024/2025

https://scholar.google.com/scholar?q=Repeat+After+Me:+Transformers+Are+Better+Than+State+Space+Models+at+Copying

18. Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling — approx. recent hybrid SSM authors, 2024/2025

https://scholar.google.com/scholar?q=Samba:+Simple+Hybrid+State+Space+Models+for+Efficient+Unlimited+Context+Language+Modeling

19. Maximally-Informative Retrieval for State Space Model Generation — approx. recent retrieval/SSM authors, 2024/2025

https://scholar.google.com/scholar?q=Maximally-Informative+Retrieval+for+State+Space+Model+Generation

20. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/

21. AI Post Transformers: Jet-Nemotron and PostNAS for Faster Long Context — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-24-jet-nemotron-and-postnas-for-faster-long-436381.mp3

22. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3

23. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3

24. AI Post Transformers: Memory Sparse Attention for 100M-Token Scaling — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-07-memory-sparse-attention-for-100m-token-s-377cff.mp3

25. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3

Interactive Visualization: Mamba-3 for Efficient Sequence Modeling

...more

View all episodes

By mcgrof