Echo: The Paper Arguing You Never Needed a KV Cache for Retrieval
Source: Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
Paper was published on May 07, 2026
This episode was AI-generated on May 11, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
A pure Mamba-2 scores 3% on the canonical associative recall benchmark. Echo scores 100% — using a fixed-size state about five thousand times smaller than an equivalent KV cache. The argument isn't that attention got better; it's that retrieval was a regression problem all along, and the KV cache is an artifact of solving it the hard way.
Key Takeaways
Why retrieval can be reframed as ridge regression solvable from running sufficient statistics, making the KV cache an implementation choice rather than a necessityHow Echo's Spectral Koopman Attention uses a lag-one covariance and eigenvalue filter to suppress one-off distractors — a selectivity mechanism standard attention can't expressThe concrete memory comparison: ~77 KB of state for Echo versus ~384 MB per layer of KV cache at 131k tokensWhy this method gets more accurate with longer sequences, inverting the state-space 'memory cliff'Where the headline result is most fragile: scale is capped at 180M parameters, benchmarks lean on synthetic retrieval tasks like MQAR, and ablations don't cleanly separate the closed-form solve from the spectral filterWhy the wall-clock speedup hasn't landed yet even though the memory win has00:00 — The memory cliff and the three-percent floor
Why state-space models collapse to chance on associative recall regardless of scale, and how hybrids only shrink the problem rather than solve it.03:59 — Retrieval as regression, not attention
The conceptual move at the heart of the paper: trained attention converges to ridge regression, and ridge regression has a closed-form solution computable from constant-size running totals.07:59 — Inside Spectral Koopman Attention
The three accumulators Echo maintains per layer, and how a lag-one covariance lets you fit a Koopman operator whose eigenvalues filter persistent bindings from transient noise.11:58 — The headline numbers
100% on MQAR versus 3% for Mamba-2, length generalization to 64× the training horizon, and a ~5000× memory reduction at long context.15:58 — Steelmanning the skeptics
Scale caps at 180M parameters, benchmarks are heavily synthetic, ablations don't isolate the spectral filter's contribution, and the speed advantage is still gated on kernel work.19:57 — Why the framing matters more than the benchmark
What changes if 'retrieval is regression' holds at scale — for agentic workloads, long-context deployment, and the design space of future architectures.Recommended Reading
Zoology: Measuring and Improving Recall in Efficient Language Models — The paper that introduced the MQAR benchmark central to this episode and crystallized the 'state-space models can't do associative recall' problem Echo is trying to solve.Mamba: Linear-Time Sequence Modeling with Selective State Spaces — The state-space architecture whose 3% MQAR score is the foil for Echo's 100%, and the baseline whose memory cliff motivates the whole paper.Transformers Learn In-Context by Gradient Descent — Background for the episode's key reframing — that trained attention implements a classical regression estimator, which is the conceptual move Echo exploits to replace it.Jamba: A Hybrid Transformer-Mamba Language Model — A production example of the hybrid approach Echo argues against — keeping some attention layers (and their KV cache) to patch state-space recall failures.