AI Papers: A Deep Dive

Why LLM Judges Flip Their Verdicts When You Change the Question Format


Listen Later

Why LLM Judges Flip Their Verdicts When You Change the Question Format

Source: Judge Circuits

Paper was published on May 15, 2026

This episode was AI-generated on May 19, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

Ask a language model to rate text from 1 to 5 and it says four. Ask it yes-or-no on the same text and it says no. A new paper opens the hood and finds the judgment itself is stable — what wobbles is a tiny piece of machinery near the output that translates an abstract verdict into whichever token the prompt demanded. If they're right, a lot of what we call evaluator unreliability is actually a formatting artifact.

Key Takeaways
  • Why LLM judges produce inconsistent scores across prompt formats — and why the inconsistency lives in output routing, not evaluation quality
  • The 'transplant' experiment: copying activations from a rating prompt into a yes-no prompt flips the model's answer over 99% of the time on some models
  • Evidence that judgment is encoded along a single direction in activation space — a 'compass needle' that transfers across grammar, entailment, similarity, and preference tasks
  • A practical alternative to prompted scoring: read the judgment axis directly from mid-layer activations and bypass the noisy formatter
  • Where the clean modularity story breaks down — Gemma-3 at 12B entangles judgment with world knowledge in a way no other tested model does
  • Honest limits of the result: small per-cell sample sizes, probe design that partly presupposes a 1D encoding, and a universality claim the authors deliberately don't make
    • 00:00 — The format-inconsistency puzzle
      Why the same model gives different verdicts under different output formats, and why prior work stopped at behavioral observation.
    • 03:12 — Latent Evaluator and Task Formatters
      The paper's central decomposition: a shared judgment core in the middle layers, and format-specific translators at the output.
    • 04:07 — Format Transfer Injection
      Transplanting activations from a rating run into a yes-no run flips the output — judgment is portable, formatting is local.
    • 09:36 — Judgment as a single direction
      Evidence that the model's verdict reduces to a scalar along one axis in activation space, transferable across tasks and steerable.
    • 12:48 — How they found the circuit
      Position-aware Edge Attribution Patching, contrastive circuit tracing, and independent SAE corroboration of the shared core.
    • 16:00 — Where the modularity story breaks
      Gemma-3 at 12B entangles judgment with world knowledge while its 27B sibling doesn't — modularity is architecture- and scale-dependent.
    • 19:12 — Pressure points and limitations
      Small sample sizes behind the headline numbers, probe-design assumptions, and the absence of head-to-head comparison with an independent attribution method.
    • 22:24 — What this means for LLM-as-a-judge
      Reading the internal judgment axis directly can beat the model's prompted output, and benchmark format differences may measure formatter geometry rather than evaluation quality.
    • Recommended Reading
      • Towards Automated Circuit Discovery for Mechanistic Interpretability — Introduces the circuit-discovery methodology that PEAP builds on, giving listeners the foundation for how researchers identify causally important sub-networks like the Latent Evaluator.
      • Attribution Patching: Activation Patching At Industrial Scale — The gradient-based attribution method that PEAP extends to edges; useful for understanding the one-forward-one-backward shortcut the episode describes for scoring millions of connections.
      • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — The canonical reference for the LLM-as-a-judge paradigm whose format-inconsistency problems this episode's paper goes inside the model to explain.
      • Do Llamas Work in English? On the Latent Language of Multilingual Transformers — A parallel 'compute abstractly, then translate' finding for multilingual models, supporting the episode's broader claim that judgment-vs-formatting is part of a recurring pattern in interpretability.
      • ...more
        View all episodesView all episodes
        Download on the App Store

        AI Papers: A Deep DiveBy paperdive.ai