How LLMs Get Persuaded: One Attention Head, A Tetrahedron, And A Single Dial
Source: How LLMs Are Persuaded: A Few Attention Heads, Rerouted
Paper was published on May 10, 2026
This episode was AI-generated on May 12, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
A new paper traces the entire causal chain of how a persuasive passage flips a large language model's answer — and the machinery turns out to be astonishingly narrow. One attention head out of a thousand, a three-dimensional pyramid of choices, and a single scalar lever that decides which option wins.
Key Takeaways
Persuasion in LLMs is mediated by a tiny number of mid-layer attention heads — often just one — verified by causal activation patching across four model familiesThe decision head encodes four answer options as four vertices of a near-regular tetrahedron, and persuasion is a discrete jump between vertices, not a gradual drift in uncertaintyThe head isn't reasoning — it's copying; the 'where to look' circuit explains ~88% of the persuasion effect while the value-copy circuit is nearly perfect transcriptionAll the high-dimensional routing logic collapses to a single scalar feature per option token, which the authors can turn up or down to steer the model's choiceUpstream shallow heads in layers 8–12 do keyword recognition (like spotting 'Nigeria') and write the routing signal onto matching option tokens, completing the relayThe mechanism partially transfers to a more realistic GEO benchmark, but the cleanest results (tetrahedron, rank-1 feature, discrete jump) are tied to a four-option choice geometry that may not generalize to free-form generation00:00 — GEO and the question of mechanism
Why Generative Engine Optimization works in practice, and what it means to ask mechanically how an LLM gets persuaded into repeating a poisoned source.02:34 — Locating the persuasion circuit
How activation patching identifies a single mid-layer attention head as causally responsible for persuasion, replicated across Llama, Qwen, and Gemma families.05:09 — The tetrahedron and the discrete jump
The decision head's output collapses to a 3D subspace where four answer choices sit at vertices of a pyramid, and persuasion flips the state from one vertex to another.07:44 — The head is copying, not reasoning
Separating the 'where to look' and 'what to copy' circuits reveals that persuasion is misdirected attention, not corrupted knowledge.10:19 — The single dial behind the routing
A rank-1 approximation reduces the head's decision logic to one scalar per option, and the authors steer the model's answer by adding or subtracting that feature directly.12:54 — Upstream keyword heads and the full relay
Shallow heads in layers 8–12 do the keyword recognition that writes the routing signal, completing a verified end-to-end chain from prompt token to final answer.15:29 — Does this transfer to real attacks?
Results on Geo-Bench, a more realistic source-selection benchmark, show the mechanism's architecture holds though with weaker magnitudes and some experiments not yet repeated.18:04 — Steelman and limitations
Honest pushback on the multiple-choice setup, the rank-1 framing, and what 'partial transfer' really means for the strength of the claims.20:39 — What the map enables
Why a known mechanism opens the door to runtime monitors and feature-level interventions, and what this adds to the case that LLM behaviors are sparse and legible.Recommended Reading
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small — The canonical example of using activation patching to isolate a narrow attention-head circuit responsible for a specific behavior — the methodological template the persuasion paper builds on.A Mathematical Framework for Transformer Circuits — Anthropic's foundational decomposition of attention heads into QK (where to look) and OV (what to copy) circuits — exactly the split the episode hinges on when separating routing from transcription.Locating and Editing Factual Associations in GPT (ROME) — A contrasting case study where causal tracing localizes factual knowledge to MLPs rather than attention heads, useful for thinking about when persuasion-style routing vs. knowledge storage dominates.Towards Understanding Sycophancy in Language Models — Anthropic's empirical study of a closely related failure mode — models being swayed by user framing rather than evidence — which the episode cites as part of the sparse-behavior pattern.