May 15, 2026

When the Iteration Teaches the Model to Skip the Iteration

29 minutes

Source: Solve the Loop: Attractor Models for Language and Reasoning

Paper was published on May 12, 2026

This episode was AI-generated on May 13, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

Three frontier language models score zero on hard Sudoku. A 27-million parameter model solves 91%. But the real surprise in this paper isn't the benchmark — it's that the iterative refinement procedure the model is trained with quietly disappears at inference, absorbed into a single forward pass. We work through how that happens, and what it might mean.

Key Takeaways

Why looped Transformers have been fragile to train, and how reframing the loop as a fixed-point problem gives you constant training memory regardless of iteration depth

The implicit differentiation trick that makes this work — and the cheaper one-step approximation that holds up in language modeling but breaks in the reasoning regime

How Attractor Models build a new Pareto frontier on language modeling, matching a 1.3B Transformer at 770M parameters

Why TRM collapses from 75% to 0% on Sudoku when you scale it from 7M to 27M parameters — and why the Attractor Model at the same scale doesn't

Equilibrium internalization: the trained backbone learns to put its first guess at the fixed point, making the refinement module obsolete at inference — an emergent self-distillation nobody designed in

The 'implicit gradient barrier' argument for why this training is structurally more stable than fixed-depth looped training, and where that argument is intuition rather than proof

00:00 — The problem with one-forward-pass reasoning
Why Transformers can't think harder about harder tokens, and why prior fixes — chain-of-thought and looped Transformers — each have serious costs.

03:18 — Don't unroll the loop, solve for where it ends
The empirical observation that trained looped models are doing fixed-point iteration, and the reframe that follows from taking that seriously.

06:37 — Implicit differentiation and constant-memory training
How differentiating the equilibrium condition itself — rather than the trajectory — decouples training memory from iteration depth, and the cheap approximation the paper actually uses.

09:56 — Two design choices that make it work
Putting the equilibrium in output-embedding space and initializing the solver from a full Transformer's draft, rather than from zero in an abstract hidden state.

13:15 — Language modeling results
A new Pareto frontier on perplexity versus compute, with the 770M model matching a 1.3B Transformer trained on twice the tokens.

16:34 — The Sudoku result and what it actually means
Frontier LLMs at zero, TRM collapsing at 27M parameters, and the Attractor Model at 91% — plus a careful read of which comparison is the fair fight.

27:11 — Equilibrium internalization
The most striking finding in the paper: the refinement procedure trains the backbone to produce the converged answer in one pass, making the iteration unnecessary at inference.

22:06 — The implicit gradient barrier
A theoretical argument for why this training stays in the stable, contractive regime — and where the argument is intuition rather than guarantee.

26:31 — Where the paper reaches and what to watch
Two distinct training recipes hiding under one architecture diagram, fast-moving baselines, and the bigger idea: baking expensive teachers into training so models internalize them for free.

When the Iteration Teaches the Model to Skip the Iteration

29 minutes

When the Iteration Teaches the Model to Skip the Iteration

Source: Solve the Loop: Attractor Models for Language and Reasoning

Paper was published on May 12, 2026

Key Takeaways

Why looped Transformers have been fragile to train, and how reframing the loop as a fixed-point problem gives you constant training memory regardless of iteration depth

The implicit differentiation trick that makes this work — and the cheaper one-step approximation that holds up in language modeling but breaks in the reasoning regime

How Attractor Models build a new Pareto frontier on language modeling, matching a 1.3B Transformer at 770M parameters

Why TRM collapses from 75% to 0% on Sudoku when you scale it from 7M to 27M parameters — and why the Attractor Model at the same scale doesn't

The 'implicit gradient barrier' argument for why this training is structurally more stable than fixed-depth looped training, and where that argument is intuition rather than proof

13:15 — Language modeling results
A new Pareto frontier on perplexity versus compute, with the 770M model matching a 1.3B Transformer trained on twice the tokens.

22:06 — The implicit gradient barrier
A theoretical argument for why this training stays in the stable, contractive regime — and where the argument is intuition rather than guarantee.

Share When the Iteration Teaches the Model to Skip the Iteration

Sign up to save your podcasts

When the Iteration Teaches the Model to Skip the Iteration

When the Iteration Teaches the Model to Skip the Iteration