
Sign up to save your podcasts
Or


This paper introduces Fully Unsupervised Score Ensembling (FUSE), a novel framework designed to improve the accuracy of large language model (LLM) outputs without requiring human-labeled data. By aggregating scores from multiple imperfect verifiers, FUSE identifies the most reliable responses during the inference process, a technique known as test-time scaling. The method addresses the limitations of traditional ensembling by mathematically adjusting for statistical dependencies between verifiers that typically hinder unsupervised performance. Experimental results demonstrate that FUSE frequently matches or exceeds the performance of semi-supervised models that have access to ground truth labels. This effectiveness is validated across diverse benchmarks, ranging from academic datasets like MMLU to highly difficult math and logic exams. Ultimately, FUSE offers a scalable, cost-effective solution for filtering synthetic data and enhancing model reliability in complex reasoning tasks.
By Enoch H. KangThis paper introduces Fully Unsupervised Score Ensembling (FUSE), a novel framework designed to improve the accuracy of large language model (LLM) outputs without requiring human-labeled data. By aggregating scores from multiple imperfect verifiers, FUSE identifies the most reliable responses during the inference process, a technique known as test-time scaling. The method addresses the limitations of traditional ensembling by mathematically adjusting for statistical dependencies between verifiers that typically hinder unsupervised performance. Experimental results demonstrate that FUSE frequently matches or exceeds the performance of semi-supervised models that have access to ground truth labels. This effectiveness is validated across diverse benchmarks, ranging from academic datasets like MMLU to highly difficult math and logic exams. Ultimately, FUSE offers a scalable, cost-effective solution for filtering synthetic data and enhancing model reliability in complex reasoning tasks.