Share How to Correctly Report LLM-as-a-Judge Evaluations

Copy link

December 02, 2025

How to Correctly Report LLM-as-a-Judge Evaluations

11 minutes

This paper introduces a statistical framework to address the significant challenge of noisy and biased accuracy estimates that arise when utilizing Large Language Models (LLMs) as judges. The text explains that the raw proportion of correct judgments is unreliable because the LLM judge possesses imperfect specificity and sensitivity, leading to distorted results depending on the true accuracy level. To counteract this, the authors develop a **simple plug-in bias-adjusted estimator** that corrects the results by estimating the LLM judge's internal error rates from a separate calibration dataset. Furthermore, the framework provides a practical method for generating **statistically sound confidence intervals**, ensuring that the reported uncertainty incorporates variance from both the main test set and the calibration sample. This approach is optimized through an **adaptive allocation algorithm** designed to efficiently distribute calibration resources, thereby minimizing the length of the confidence intervals and increasing the overall reliability of LLM-based evaluations.

...more

View all episodes

By Enoch H. Kang

December 02, 2025

How to Correctly Report LLM-as-a-Judge Evaluations

11 minutes

...more

Sign up to save your podcasts