August 08, 2025

Test-Time Scaling

48 minutes

The provided sources discuss advancements in large language models (LLMs), specifically focusing on test-time compute scaling to enhance reasoning performance. One paper introduces s1-32B, an open-source model trained on a small, curated dataset of 1,000 reasoning problems, and its novel technique called budget forcing. This method controls the model's "thinking time" to improve accuracy on complex tasks, such as mathematical problem-solving. The other source is a figure illustrating a beam search example, a common technique used in LLM inference.

Two research papers are reviewed:

1) https://arxiv.org/pdf/2408.03314 - 2024 - Scaling LLM Test-Time Compute Optimally can

be More Effective than Scaling Model Parameters

2) https://arxiv.org/pdf/2501.19393 - 2025 - s1: Simple test-time scaling

...more