March 28, 2025

【第179期】s1: Simple test-time scaling

16 minutes

Seventy3: 用NotebookLM将论文生成播客，让大家跟着AI一起进步。

今天的主题是：s1: Simple test-time scaling

Summary

This research explores improving language model reasoning through a technique called test-time scaling, where extra computation during inference enhances performance. The authors introduce s1K, a small, high-quality dataset of reasoning problems, and budget forcing, a method to control the model's computational effort at test time. By finetuning a language model on s1K and using budget forcing, they achieve strong results on math reasoning benchmarks, even surpassing previously reported methods while using significantly less training data. The work also analyzes different approaches to test-time scaling, finding sequential methods like budget forcing more effective than parallel ones like majority voting. Ultimately, this study demonstrates a sample-efficient way to boost reasoning through strategic test-time computation.

本研究探讨了通过测试时扩展（test-time scaling）提升语言模型推理能力的方法，即在推理阶段增加计算量以增强性能。作者提出了s1K——一个小型高质量的推理问题数据集，并引入了预算强制（budget forcing），一种在测试时控制模型计算资源的方法。通过在 s1K 上微调语言模型并应用预算强制，研究在数学推理基准上取得了优异成绩，甚至在训练数据大幅减少的情况下超越了此前的方法。此外，研究分析了不同的测试时扩展策略，发现顺序方法（如预算强制）比并行方法（如多数投票）更有效。最终，该研究证明了一种数据高效的方式，即通过策略性测试时计算来提升推理能力。

原文链接：https://arxiv.org/abs/2501.19393

...more