AI Post Transformers

Mamba-3 for Efficient Sequence Modeling


Listen Later

This episode explores Mamba-3, a new state space sequence model that argues architecture should be judged not just by perplexity, but by deployment realities like decode latency, throughput, and hardware efficiency. It explains how Mamba-3 revisits earlier Mamba-style models with three main changes—a new exponential-trapezoidal discretization, complex-valued state dynamics, and a MIMO input-output structure—aimed at improving the quality-efficiency tradeoff for long-sequence inference. The discussion also situates the work against transformers, whose KV-cache costs grow with context, and against competing linear-recurrence approaches like DeltaNet and emerging hybrid industry systems. Listeners would find it interesting because it highlights a broader shift in machine learning: whether the future of sequence models will be decided less by benchmark curves alone and more by how well they actually run in production.
Sources:
1. Mamba-3: Improved Sequence Modeling using State Space Principles — Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, Albert Gu, 2026
http://arxiv.org/abs/2603.15569
2. Efficiently Modeling Long Sequences with Structured State Spaces — Albert Gu, Karan Goel, Christopher Re, 2021
https://scholar.google.com/scholar?q=Efficiently+Modeling+Long+Sequences+with+Structured+State+Spaces
3. On the Parameterization and Initialization of Diagonal State Space Models — Albert Gu, Ankit Gupta, Jonathan Berant, Tri Dao, Christopher Re, 2022
https://scholar.google.com/scholar?q=On+the+Parameterization+and+Initialization+of+Diagonal+State+Space+Models
4. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Albert Gu, Tri Dao, 2024
https://scholar.google.com/scholar?q=Mamba:+Linear-Time+Sequence+Modeling+with+Selective+State+Spaces
5. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Tri Dao, Albert Gu, 2024
https://scholar.google.com/scholar?q=Transformers+are+SSMs:+Generalized+Models+and+Efficient+Algorithms+Through+Structured+State+Space+Duality
6. Unitary Evolution Recurrent Neural Networks — Martin Arjovsky, Amar Shah, Yoshua Bengio, 2016
https://scholar.google.com/scholar?q=Unitary+Evolution+Recurrent+Neural+Networks
7. HiPPO: Recurrent Memory with Optimal Polynomial Projections — Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher Re, 2020
https://scholar.google.com/scholar?q=HiPPO:+Recurrent+Memory+with+Optimal+Polynomial+Projections
8. The DeltaNet Family: Efficient Sequence Modeling via State Tracking — Michael Schlag, Kazuki Irie, and Jürgen Schmidhuber; later Gated DeltaNet variants by Shang Yang, Boyuan Wang, Yuhang Zhang, et al., 2021 / 2025
https://scholar.google.com/scholar?q=The+DeltaNet+Family:+Efficient+Sequence+Modeling+via+State+Tracking
9. Rotary Position Embedding — Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu, 2023
https://scholar.google.com/scholar?q=Rotary+Position+Embedding
10. On the Computational Limits of State Space Models and Their Ability to Track State — Ruggero Grazzi, Julien Siems, Arlind Zela, et al., 2025
https://scholar.google.com/scholar?q=On+the+Computational+Limits+of+State+Space+Models+and+Their+Ability+to+Track+State
11. State Space Models Fail at Simple State Tracking Tasks — Aviad Sarrof, Tom Veitsman, and Michael Hahn, 2024
https://scholar.google.com/scholar?q=State+Space+Models+Fail+at+Simple+State+Tracking+Tasks
12. Hungry Hungry Hippos: Towards Language Modeling with State Space Models — Atri Rudra? (No—better to omit uncertain authorship) / H3 team, 2023
https://scholar.google.com/scholar?q=Hungry+Hungry+Hippos:+Towards+Language+Modeling+with+State+Space+Models
13. Kimi Linear — Kimi Team, 2025
https://scholar.google.com/scholar?q=Kimi+Linear
14. Kvzip: Query-agnostic KV Cache Compression with Context Reconstruction — approx. recent LLM systems authors, 2024/2025
https://scholar.google.com/scholar?q=Kvzip:+Query-agnostic+KV+Cache+Compression+with+Context+Reconstruction
15. KVLINK: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. recent LLM systems authors, 2024/2025
https://scholar.google.com/scholar?q=KVLINK:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse
16. KV-CAR: KV Cache Compression using Autoencoders and KV Reuse in Large Language Models — approx. recent LLM systems authors, 2024/2025
https://scholar.google.com/scholar?q=KV-CAR:+KV+Cache+Compression+using+Autoencoders+and+KV+Reuse+in+Large+Language+Models
17. Repeat After Me: Transformers Are Better Than State Space Models at Copying — approx. recent sequence-model authors, 2024/2025
https://scholar.google.com/scholar?q=Repeat+After+Me:+Transformers+Are+Better+Than+State+Space+Models+at+Copying
18. Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling — approx. recent hybrid SSM authors, 2024/2025
https://scholar.google.com/scholar?q=Samba:+Simple+Hybrid+State+Space+Models+for+Efficient+Unlimited+Context+Language+Modeling
19. Maximally-Informative Retrieval for State Space Model Generation — approx. recent retrieval/SSM authors, 2024/2025
https://scholar.google.com/scholar?q=Maximally-Informative+Retrieval+for+State+Space+Model+Generation
20. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/
21. AI Post Transformers: Jet-Nemotron and PostNAS for Faster Long Context — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-24-jet-nemotron-and-postnas-for-faster-long-436381.mp3
22. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3
23. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3
24. AI Post Transformers: Memory Sparse Attention for 100M-Token Scaling — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-07-memory-sparse-attention-for-100m-token-s-377cff.mp3
25. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3
Interactive Visualization: Mamba-3 for Efficient Sequence Modeling
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof