April 01, 2026

EP139: Mamba-3 Fixes the Transformer Memory Bottleneck

20 minutes

The paper "Mamba-3: Improved Sequence Modeling using State Space Principles" introduces an advanced state space model (SSM) designed to push the performance-efficiency Pareto frontier for Large Language Models (LLMs). Guided by an inference-first perspective, the authors address the quality and hardware-efficiency limitations of prior sub-quadratic models through three core methodological innovations:

Exponential-Trapezoidal Discretization: A more expressive recurrence derived from SSM discretization that provides a second-order accurate approximation of the state-input integral. This method induces an implicit data-dependent convolution, which empirically allows the model to function effectively without the external short causal convolutions typical in other architectures.
Complex-valued State Space Models: To overcome the inability of real-valued SSMs to solve certain state-tracking tasks (like parity), Mamba-3 utilizes complex-valued state updates. This is implemented efficiently using a "RoPE trick" that applies data-dependent rotary embeddings to the model's projections.
Multi-Input, Multi-Output (MIMO) Formulation: This refinement shifts from outer-product-based updates to matrix-multiplication-based updates, increasing arithmetic intensity and hardware utilization during decoding. It allows for increased model FLOPs and expressivity without significantly increasing decode latency.

Empirically, Mamba-3 demonstrates significant gains across language modeling, retrieval, and state-tracking tasks. At the 1.5B scale, its MIMO variant improves average downstream accuracy by 1.8 percentage points over the next best model (Gated DeltaNet). Furthermore, Mamba-3 achieves comparable perplexity to its predecessor, Mamba-2, while using half the state size, resulting in a faster and more efficient model.

...more

View all episodes

By Yun Wu

April 01, 2026

EP139: Mamba-3 Fixes the Transformer Memory Bottleneck

20 minutes

Exponential-Trapezoidal Discretization: A more expressive recurrence derived from SSM discretization that provides a second-order accurate approximation of the state-input integral. This method induces an implicit data-dependent convolution, which empirically allows the model to function effectively without the external short causal convolutions typical in other architectures.
Complex-valued State Space Models: To overcome the inability of real-valued SSMs to solve certain state-tracking tasks (like parity), Mamba-3 utilizes complex-valued state updates. This is implemented efficiently using a "RoPE trick" that applies data-dependent rotary embeddings to the model's projections.
Multi-Input, Multi-Output (MIMO) Formulation: This refinement shifts from outer-product-based updates to matrix-multiplication-based updates, increasing arithmetic intensity and hardware utilization during decoding. It allows for increased model FLOPs and expressivity without significantly increasing decode latency.

...more

Share EP139: Mamba-3 Fixes the Transformer Memory Bottleneck

Sign up to save your podcasts

EP139: Mamba-3 Fixes the Transformer Memory Bottleneck

EP139: Mamba-3 Fixes the Transformer Memory Bottleneck