May 19, 2026

Why LLM Judges Flip Their Verdicts When You Change the Question Format

25 minutes

Source: Judge Circuits

Paper was published on May 15, 2026

This episode was AI-generated on May 19, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

Ask a language model to rate text from 1 to 5 and it says four. Ask it yes-or-no on the same text and it says no. A new paper opens the hood and finds the judgment itself is stable — what wobbles is a tiny piece of machinery near the output that translates an abstract verdict into whichever token the prompt demanded. If they're right, a lot of what we call evaluator unreliability is actually a formatting artifact.

Key Takeaways

Why LLM judges produce inconsistent scores across prompt formats — and why the inconsistency lives in output routing, not evaluation quality

The 'transplant' experiment: copying activations from a rating prompt into a yes-no prompt flips the model's answer over 99% of the time on some models

Evidence that judgment is encoded along a single direction in activation space — a 'compass needle' that transfers across grammar, entailment, similarity, and preference tasks

A practical alternative to prompted scoring: read the judgment axis directly from mid-layer activations and bypass the noisy formatter

Where the clean modularity story breaks down — Gemma-3 at 12B entangles judgment with world knowledge in a way no other tested model does

Honest limits of the result: small per-cell sample sizes, probe design that partly presupposes a 1D encoding, and a universality claim the authors deliberately don't make

00:00 — The format-inconsistency puzzle
Why the same model gives different verdicts under different output formats, and why prior work stopped at behavioral observation.

03:12 — Latent Evaluator and Task Formatters
The paper's central decomposition: a shared judgment core in the middle layers, and format-specific translators at the output.

04:07 — Format Transfer Injection
Transplanting activations from a rating run into a yes-no run flips the output — judgment is portable, formatting is local.

09:36 — Judgment as a single direction
Evidence that the model's verdict reduces to a scalar along one axis in activation space, transferable across tasks and steerable.

12:48 — How they found the circuit
Position-aware Edge Attribution Patching, contrastive circuit tracing, and independent SAE corroboration of the shared core.

16:00 — Where the modularity story breaks
Gemma-3 at 12B entangles judgment with world knowledge while its 27B sibling doesn't — modularity is architecture- and scale-dependent.

19:12 — Pressure points and limitations
Small sample sizes behind the headline numbers, probe-design assumptions, and the absence of head-to-head comparison with an independent attribution method.

22:24 — What this means for LLM-as-a-judge
Reading the internal judgment axis directly can beat the model's prompted output, and benchmark format differences may measure formatter geometry rather than evaluation quality.