Why LLM Judges Flip Their Verdicts When You Change the Question Format
Paper was published on May 15, 2026
This episode was AI-generated on May 19, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
Ask a language model to rate text from 1 to 5 and it says four. Ask it yes-or-no on the same text and it says no. A new paper opens the hood and finds the judgment itself is stable — what wobbles is a tiny piece of machinery near the output that translates an abstract verdict into whichever token the prompt demanded. If they're right, a lot of what we call evaluator unreliability is actually a formatting artifact.
Key Takeaways
Why LLM judges produce inconsistent scores across prompt formats — and why the inconsistency lives in output routing, not evaluation qualityThe 'transplant' experiment: copying activations from a rating prompt into a yes-no prompt flips the model's answer over 99% of the time on some modelsEvidence that judgment is encoded along a single direction in activation space — a 'compass needle' that transfers across grammar, entailment, similarity, and preference tasksA practical alternative to prompted scoring: read the judgment axis directly from mid-layer activations and bypass the noisy formatterWhere the clean modularity story breaks down — Gemma-3 at 12B entangles judgment with world knowledge in a way no other tested model doesHonest limits of the result: small per-cell sample sizes, probe design that partly presupposes a 1D encoding, and a universality claim the authors deliberately don't make00:00 — The format-inconsistency puzzle
Why the same model gives different verdicts under different output formats, and why prior work stopped at behavioral observation.03:12 — Latent Evaluator and Task Formatters
The paper's central decomposition: a shared judgment core in the middle layers, and format-specific translators at the output.04:07 — Format Transfer Injection
Transplanting activations from a rating run into a yes-no run flips the output — judgment is portable, formatting is local.09:36 — Judgment as a single direction
Evidence that the model's verdict reduces to a scalar along one axis in activation space, transferable across tasks and steerable.12:48 — How they found the circuit
Position-aware Edge Attribution Patching, contrastive circuit tracing, and independent SAE corroboration of the shared core.16:00 — Where the modularity story breaks
Gemma-3 at 12B entangles judgment with world knowledge while its 27B sibling doesn't — modularity is architecture- and scale-dependent.19:12 — Pressure points and limitations
Small sample sizes behind the headline numbers, probe-design assumptions, and the absence of head-to-head comparison with an independent attribution method.22:24 — What this means for LLM-as-a-judge
Reading the internal judgment axis directly can beat the model's prompted output, and benchmark format differences may measure formatter geometry rather than evaluation quality.Recommended Reading
Towards Automated Circuit Discovery for Mechanistic Interpretability — Introduces the circuit-discovery methodology that PEAP builds on, giving listeners the foundation for how researchers identify causally important sub-networks like the Latent Evaluator.Attribution Patching: Activation Patching At Industrial Scale — The gradient-based attribution method that PEAP extends to edges; useful for understanding the one-forward-one-backward shortcut the episode describes for scoring millions of connections.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — The canonical reference for the LLM-as-a-judge paradigm whose format-inconsistency problems this episode's paper goes inside the model to explain.Do Llamas Work in English? On the Latent Language of Multilingual Transformers — A parallel 'compute abstractly, then translate' finding for multilingual models, supporting the episode's broader claim that judgment-vs-formatting is part of a recurring pattern in interpretability.