Language Models Compute the Rational Move, Then Override It
Source: What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control
Paper was published on April 29, 2026
This episode was AI-generated on May 3, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
Two language models playing Prisoner's Dilemma both internally compute that they should defect — and then cooperate anyway, every single time. A new paper finds the override circuit, and shows the entire strategic behavior of the model collapses to a single dial you can turn at inference time.
Key Takeaways
Why every tested model cooperates 100% of the time in direct-mode Prisoner's Dilemma — a 'universal cooperative lock' that holds across architectures and scalesHow the logit lens reveals that Llama-3-8B votes 'defect' through 23 layers, then flips to 84% cooperation by layer 30Why ablating the most plausible attention heads does nothing — and what it means that the override is a 'choir, not a soloist'How a single steering vector added at the first three layers dials cooperation from 0.1% to 98.6%, with no retrainingWhy one small model in a multi-agent group can unravel cooperation for everyone — a failure mode invisible in self-play evaluationWhere the paper's reach exceeds its grasp: all mechanistic work is on one 8B model, and the RLHF attribution is asserted but never tested00:00 — The compute-then-suppress thesis
Why the paper's framing reverses the standard story that LLMs simply lack strategic competence.03:55 — The universal cooperative lock
Behavioral results showing every model, at every scale, locks into 100% cooperation in direct-mode Prisoner's Dilemma.07:18 — Inside the forward pass with probes and the logit lens
How layer-by-layer analysis reveals the model holding the Nash answer for most of the network before a late-layer flip to cooperation.10:57 — Why ablation fails and what that reveals
The negative result that rules out localized circuits and points toward a distributed direction in the residual stream.14:36 — Finding the dial: steering and concept clamping
How three independent methods recover the same cooperative direction, and how clamping closes the causal loop from 0.1% to 98.6% cooperation.18:16 — Cross-play and the contaminator effect
Heterogeneous model pairings expose failure modes — including one small model dragging larger ones into mutual defection — that self-play evaluation hides.21:55 — Steelmanning the limitations
Where the paper overreaches: single-model mechanistic evidence, tiny games, untested RLHF attribution, and logit lens caveats.25:34 — What this changes about studying LLMs
Why the 'compute then override' frame may generalize to honesty, refusal, and sycophancy — and what that means for inference-time control.Recommended Reading
Activation Addition: Steering Language Models Without Optimization — The foundational paper on activation steering via residual stream vectors — the technique this episode's paper applies to suppress or amplify cooperative behavior.Refusal in Language Models Is Mediated by a Single Direction — A close methodological cousin showing that refusal — another RLHF-installed behavior — also lives as a low-dimensional residual stream direction, exactly the generalization the episode speculates about.Eliciting Latent Predictions from Transformers with the Tuned Lens — Introduces the tuned lens variant Juniper flags as addressing limitations of the original logit lens used to identify the layer-24 cooperative flip.Playing Repeated Games with Large Language Models — An earlier behavioral study of LLMs in canonical 2x2 games that established the cooperative-bias observations this paper now provides a mechanistic explanation for.