
Sign up to save your podcasts
Or


This paper introduces a rigorous statistical framework for evaluating Large Language Models (LLMs) by treating the problem as a low-rank tensor completion task. The researchers address the challenges of chatbot leaderboards, such as those on platforms like Chatbot Arena, which rely on noisy and sparse human preference data from pairwise model comparisons. By assuming that model performance across various tasks and contexts is driven by a small number of latent factors, the authors demonstrate how to "borrow strength" across categories to improve accuracy. They develop semiparametric efficiency bounds and a debiased one-step estimator to provide reliable confidence intervals and uncertainty quantification for model rankings. To resolve technical bottlenecks caused by non-uniform sampling, they introduce a score-whitening method that stabilizes inference across heterogeneous matchups. Their findings offer a principled approach to constructing more robust, statistically sound leaderboards for the rapidly evolving field of AI evaluation.
By Enoch H. KangThis paper introduces a rigorous statistical framework for evaluating Large Language Models (LLMs) by treating the problem as a low-rank tensor completion task. The researchers address the challenges of chatbot leaderboards, such as those on platforms like Chatbot Arena, which rely on noisy and sparse human preference data from pairwise model comparisons. By assuming that model performance across various tasks and contexts is driven by a small number of latent factors, the authors demonstrate how to "borrow strength" across categories to improve accuracy. They develop semiparametric efficiency bounds and a debiased one-step estimator to provide reliable confidence intervals and uncertainty quantification for model rankings. To resolve technical bottlenecks caused by non-uniform sampling, they introduce a score-whitening method that stabilizes inference across heterogeneous matchups. Their findings offer a principled approach to constructing more robust, statistically sound leaderboards for the rapidly evolving field of AI evaluation.