AI Papers: A Deep Dive

Language Models Compute the Rational Move, Then Override It


Listen Later

Language Models Compute the Rational Move, Then Override It

Source: What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control

Paper was published on April 29, 2026

This episode was AI-generated on May 3, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

Two language models playing Prisoner's Dilemma both internally compute that they should defect — and then cooperate anyway, every single time. A new paper finds the override circuit, and shows the entire strategic behavior of the model collapses to a single dial you can turn at inference time.

Key Takeaways
  • Why every tested model cooperates 100% of the time in direct-mode Prisoner's Dilemma — a 'universal cooperative lock' that holds across architectures and scales
  • How the logit lens reveals that Llama-3-8B votes 'defect' through 23 layers, then flips to 84% cooperation by layer 30
  • Why ablating the most plausible attention heads does nothing — and what it means that the override is a 'choir, not a soloist'
  • How a single steering vector added at the first three layers dials cooperation from 0.1% to 98.6%, with no retraining
  • Why one small model in a multi-agent group can unravel cooperation for everyone — a failure mode invisible in self-play evaluation
  • Where the paper's reach exceeds its grasp: all mechanistic work is on one 8B model, and the RLHF attribution is asserted but never tested
    • 00:00 — The compute-then-suppress thesis
      Why the paper's framing reverses the standard story that LLMs simply lack strategic competence.
    • 03:55 — The universal cooperative lock
      Behavioral results showing every model, at every scale, locks into 100% cooperation in direct-mode Prisoner's Dilemma.
    • 07:18 — Inside the forward pass with probes and the logit lens
      How layer-by-layer analysis reveals the model holding the Nash answer for most of the network before a late-layer flip to cooperation.
    • 10:57 — Why ablation fails and what that reveals
      The negative result that rules out localized circuits and points toward a distributed direction in the residual stream.
    • 14:36 — Finding the dial: steering and concept clamping
      How three independent methods recover the same cooperative direction, and how clamping closes the causal loop from 0.1% to 98.6% cooperation.
    • 18:16 — Cross-play and the contaminator effect
      Heterogeneous model pairings expose failure modes — including one small model dragging larger ones into mutual defection — that self-play evaluation hides.
    • 21:55 — Steelmanning the limitations
      Where the paper overreaches: single-model mechanistic evidence, tiny games, untested RLHF attribution, and logit lens caveats.
    • 25:34 — What this changes about studying LLMs
      Why the 'compute then override' frame may generalize to honesty, refusal, and sycophancy — and what that means for inference-time control.
    • Recommended Reading
      • Activation Addition: Steering Language Models Without Optimization — The foundational paper on activation steering via residual stream vectors — the technique this episode's paper applies to suppress or amplify cooperative behavior.
      • Refusal in Language Models Is Mediated by a Single Direction — A close methodological cousin showing that refusal — another RLHF-installed behavior — also lives as a low-dimensional residual stream direction, exactly the generalization the episode speculates about.
      • Eliciting Latent Predictions from Transformers with the Tuned Lens — Introduces the tuned lens variant Juniper flags as addressing limitations of the original logit lens used to identify the layer-24 cooperative flip.
      • Playing Repeated Games with Large Language Models — An earlier behavioral study of LLMs in canonical 2x2 games that established the cooperative-bias observations this paper now provides a mechanistic explanation for.
      • ...more
        View all episodesView all episodes
        Download on the App Store

        AI Papers: A Deep DiveBy paperdive.ai