Share Distribution-calibrated inference time compute for thinking llm-as-a-judge

Copy link

December 11, 2025

Distribution-calibrated inference time compute for thinking llm-as-a-judge

11 minutes

This paper discusses the Distribution-Calibrated Aggregation scheme designed to improve the reliability of "Thinking-LLM-as-a-Judge" systems, which are often used for evaluating generative AI outputs. The core problem addressed is that simply aggregating multiple, noisy individual judgments (e.g., via majority vote) is suboptimal, especially when the judge is allowed to declare a tie. The proposed method utilizes Inference-Time Compute (ITC) to generate multiple independent samples and then models the three-way preference outcomes (A preferred, B preferred, or Tie) using a Bradley–Terry–Davidson formulation that accounts for both the margin of preference and the decisiveness of the vote (non-tie rate). Extensive experiments across machine translation and reward model benchmarks demonstrate that this distribution-aware aggregation consistently reduces the Mean Absolute Error (MAE) and increases accuracy, frequently matching or exceeding individual human rater performance. The authors emphasize that this calibration step is crucial for turning stochastic, individual LLM judgments into robust and accurate final ratings.

...more

View all episodes

By Enoch H. Kang

December 11, 2025

Distribution-calibrated inference time compute for thinking llm-as-a-judge

11 minutes

...more

Sign up to save your podcasts