May 06, 2026

Language Models Compute the Rational Move, Then Override It

29 minutes

Source: What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control

Paper was published on April 29, 2026

This episode was AI-generated on May 3, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

Two language models playing Prisoner's Dilemma both internally compute that they should defect — and then cooperate anyway, every single time. A new paper finds the override circuit, and shows the entire strategic behavior of the model collapses to a single dial you can turn at inference time.

Key Takeaways

Why every tested model cooperates 100% of the time in direct-mode Prisoner's Dilemma — a 'universal cooperative lock' that holds across architectures and scales

How the logit lens reveals that Llama-3-8B votes 'defect' through 23 layers, then flips to 84% cooperation by layer 30

Why ablating the most plausible attention heads does nothing — and what it means that the override is a 'choir, not a soloist'

How a single steering vector added at the first three layers dials cooperation from 0.1% to 98.6%, with no retraining

Why one small model in a multi-agent group can unravel cooperation for everyone — a failure mode invisible in self-play evaluation

Where the paper's reach exceeds its grasp: all mechanistic work is on one 8B model, and the RLHF attribution is asserted but never tested

00:00 — The compute-then-suppress thesis
Why the paper's framing reverses the standard story that LLMs simply lack strategic competence.

03:55 — The universal cooperative lock
Behavioral results showing every model, at every scale, locks into 100% cooperation in direct-mode Prisoner's Dilemma.

07:18 — Inside the forward pass with probes and the logit lens
How layer-by-layer analysis reveals the model holding the Nash answer for most of the network before a late-layer flip to cooperation.

10:57 — Why ablation fails and what that reveals
The negative result that rules out localized circuits and points toward a distributed direction in the residual stream.

14:36 — Finding the dial: steering and concept clamping
How three independent methods recover the same cooperative direction, and how clamping closes the causal loop from 0.1% to 98.6% cooperation.

18:16 — Cross-play and the contaminator effect
Heterogeneous model pairings expose failure modes — including one small model dragging larger ones into mutual defection — that self-play evaluation hides.

21:55 — Steelmanning the limitations
Where the paper overreaches: single-model mechanistic evidence, tiny games, untested RLHF attribution, and logit lens caveats.

25:34 — What this changes about studying LLMs
Why the 'compute then override' frame may generalize to honesty, refusal, and sycophancy — and what that means for inference-time control.