May 12, 2026

A Sticky-Note for Every Layer: Letting Transformers Remember What They Were Just Thinking

23 minutes

Source: State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning

Paper was published on April 30, 2026

This episode was AI-generated on May 9, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

What if a transformer didn't throw away its internal state every time it produced a word? A new paper adds a tiny per-layer memory to a frozen Gemma model — about 330,000 new parameters — and gets a 15-point reasoning gain on PhD-level science questions, trained in six hours on a single GPU. Along the way, it surfaces a measurable structure for what 'thinking in latent space' actually looks like.

Key Takeaways

How a small per-layer 'sticky note' lets a transformer carry working state between tokens without giving up parallel training

The two-pass training trick that tames a nonlinear cross-position recurrence by accepting an order-α-squared approximation

Why the controlled, matched-baseline 15-point gain on GPQA-Diamond is the real result — and why the 'beats DeepSeek V3' framing deserves caveats

Basin shifts: evidence that latent reasoning happens in two distinct regimes — long stable stretches punctuated by sudden reorganizations

The position-zero finding: a probe can read the very first hidden state and predict whether deeper iteration will help or hurt the answer

Why uniform iteration depth makes the model worse, and how a halting probe turns iteration into a per-question deliberation budget

00:00 — The puzzle: transformers that forget every token
Why standard transformers rebuild their working state from scratch at every token, and what biology suggests they might be missing.

02:54 — The mechanism: a per-layer sticky note
How the State Stream Transformer blends roughly 3% of each layer's previous output back in, unifying cross-token persistence and per-token iteration depth.

05:48 — Training a nonlinear recurrence in parallel
Why existing parallelization tricks fail for this recurrence, and the two-pass scheme that trades a small approximation for tractable training.

08:43 — Headline results and matched-baseline gains
A 15-point gap on GPQA-Diamond from a frozen Gemma 3 27B fine-tuned on grade-school math, with 330K new parameters and six hours on one GPU.

11:37 — Basin shifts: what latent reasoning looks like
Evidence that hidden states are mostly stable across iterations but occasionally undergo dramatic, content-dependent reorganizations that drive output changes.

14:31 — Position zero and the halting probe
Every GPQA-Diamond question shows a basin shift at the first generated token, and a small probe can read that state to predict whether more iteration will help or overthink.

17:26 — Steelmanning the limitations
Cross-paper comparisons, single backbone, bounded-not-measured approximation quality, and a proof-of-concept-scale halting probe.

20:20 — Why this matters: a third axis for reasoning compute
Latent compute as an alternative to scaling parameters or chain-of-thought, and what the basin-shift framework gives us as a measurable handle on hidden-state reasoning.