
Sign up to save your podcasts
Or


This research introduces a new method called Cascaded Selective Evaluation to improve the reliability of using large language models (LLMs) as judges for evaluating text generation. This approach uses a confidence estimation technique called Simulated Annotators to determine when an LLM's judgment is likely to align with human preferences. By selectively trusting LLMs based on their confidence and escalating to stronger models only when needed, the framework provides a provable guarantee of human agreement while also being more cost-effective than solely relying on the most powerful LLMs. Experimental results across different evaluation tasks demonstrate that this method achieves high human agreement with increased efficiency, even outperforming top-tier models in certain scenarios.
By Enoch H. KangThis research introduces a new method called Cascaded Selective Evaluation to improve the reliability of using large language models (LLMs) as judges for evaluating text generation. This approach uses a confidence estimation technique called Simulated Annotators to determine when an LLM's judgment is likely to align with human preferences. By selectively trusting LLMs based on their confidence and escalating to stronger models only when needed, the framework provides a provable guarantee of human agreement while also being more cost-effective than solely relying on the most powerful LLMs. Experimental results across different evaluation tasks demonstrate that this method achieves high human agreement with increased efficiency, even outperforming top-tier models in certain scenarios.