A Sticky-Note for Every Layer: Letting Transformers Remember What They Were Just Thinking
Source: State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
Paper was published on April 30, 2026
This episode was AI-generated on May 9, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
What if a transformer didn't throw away its internal state every time it produced a word? A new paper adds a tiny per-layer memory to a frozen Gemma model — about 330,000 new parameters — and gets a 15-point reasoning gain on PhD-level science questions, trained in six hours on a single GPU. Along the way, it surfaces a measurable structure for what 'thinking in latent space' actually looks like.
Key Takeaways
How a small per-layer 'sticky note' lets a transformer carry working state between tokens without giving up parallel trainingThe two-pass training trick that tames a nonlinear cross-position recurrence by accepting an order-α-squared approximationWhy the controlled, matched-baseline 15-point gain on GPQA-Diamond is the real result — and why the 'beats DeepSeek V3' framing deserves caveatsBasin shifts: evidence that latent reasoning happens in two distinct regimes — long stable stretches punctuated by sudden reorganizationsThe position-zero finding: a probe can read the very first hidden state and predict whether deeper iteration will help or hurt the answerWhy uniform iteration depth makes the model worse, and how a halting probe turns iteration into a per-question deliberation budget00:00 — The puzzle: transformers that forget every token
Why standard transformers rebuild their working state from scratch at every token, and what biology suggests they might be missing.02:54 — The mechanism: a per-layer sticky note
How the State Stream Transformer blends roughly 3% of each layer's previous output back in, unifying cross-token persistence and per-token iteration depth.05:48 — Training a nonlinear recurrence in parallel
Why existing parallelization tricks fail for this recurrence, and the two-pass scheme that trades a small approximation for tractable training.08:43 — Headline results and matched-baseline gains
A 15-point gap on GPQA-Diamond from a frozen Gemma 3 27B fine-tuned on grade-school math, with 330K new parameters and six hours on one GPU.11:37 — Basin shifts: what latent reasoning looks like
Evidence that hidden states are mostly stable across iterations but occasionally undergo dramatic, content-dependent reorganizations that drive output changes.14:31 — Position zero and the halting probe
Every GPQA-Diamond question shows a basin shift at the first generated token, and a small probe can read that state to predict whether more iteration will help or overthink.17:26 — Steelmanning the limitations
Cross-paper comparisons, single backbone, bounded-not-measured approximation quality, and a proof-of-concept-scale halting probe.20:20 — Why this matters: a third axis for reasoning compute
Latent compute as an alternative to scaling parameters or chain-of-thought, and what the basin-shift framework gives us as a measurable handle on hidden-state reasoning.Recommended Reading
Mamba: Linear-Time Sequence Modeling with Selective State Spaces — The state-space model the episode contrasts with SST — same horizontal-axis goal of persistent state, but achieved through linear recurrence amenable to the parallel scan that SST has to work harder to approximate.Universal Transformers — The canonical depth-axis paper the episode references, exploring iterating the transformer stack at a single position — the vertical dimension that SST combines with horizontal persistence.Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach — A recent recurrent-depth model that, like SST, argues for reasoning in latent space rather than via chain-of-thought tokens — useful counterpoint to the episode's framing of latent compute as a third scaling axis.GPQA: A Graduate-Level Google-Proof Q&A Benchmark — The benchmark whose Diamond subset drives the episode's headline 15-point gap and the position-zero halting probe analysis.