
Sign up to save your podcasts
Or


This paper discusses the Distribution-Calibrated Aggregation scheme designed to improve the reliability of "Thinking-LLM-as-a-Judge" systems, which are often used for evaluating generative AI outputs. The core problem addressed is that simply aggregating multiple, noisy individual judgments (e.g., via majority vote) is suboptimal, especially when the judge is allowed to declare a tie. The proposed method utilizes Inference-Time Compute (ITC) to generate multiple independent samples and then models the three-way preference outcomes (A preferred, B preferred, or Tie) using a Bradley–Terry–Davidson formulation that accounts for both the margin of preference and the decisiveness of the vote (non-tie rate). Extensive experiments across machine translation and reward model benchmarks demonstrate that this distribution-aware aggregation consistently reduces the Mean Absolute Error (MAE) and increases accuracy, frequently matching or exceeding individual human rater performance. The authors emphasize that this calibration step is crucial for turning stochastic, individual LLM judgments into robust and accurate final ratings.
By Enoch H. KangThis paper discusses the Distribution-Calibrated Aggregation scheme designed to improve the reliability of "Thinking-LLM-as-a-Judge" systems, which are often used for evaluating generative AI outputs. The core problem addressed is that simply aggregating multiple, noisy individual judgments (e.g., via majority vote) is suboptimal, especially when the judge is allowed to declare a tie. The proposed method utilizes Inference-Time Compute (ITC) to generate multiple independent samples and then models the three-way preference outcomes (A preferred, B preferred, or Tie) using a Bradley–Terry–Davidson formulation that accounts for both the margin of preference and the decisiveness of the vote (non-tie rate). Extensive experiments across machine translation and reward model benchmarks demonstrate that this distribution-aware aggregation consistently reduces the Mean Absolute Error (MAE) and increases accuracy, frequently matching or exceeding individual human rater performance. The authors emphasize that this calibration step is crucial for turning stochastic, individual LLM judgments into robust and accurate final ratings.