The paper "To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning" explores the computational trade-offs between two primary strategies for scaling Large Language Model (LLM) test-time compute: sequential search (explicit backtracking within a chain-of-thought) and parallel sampling (generating multiple independent solutions and using best-of-n selection).
By evaluating these approaches on two strategic games, CountDown and Sudoku, the authors discovered that backtracking is not universally beneficial. Under a fixed compute budget, the backtracking model significantly underperformed parallel sampling on CountDown, but outperformed it on Sudoku.
The study identifies two main reasons why teaching a model to backtrack can inadvertently degrade its performance under supervised learning:
- Prescribed search bias: Training models on fixed backtracking traces forces them to imitate specific, often suboptimal search paths, preventing them from independently discovering more efficient strategies.
- Excessive verbosity: Explicit chain-of-thought supervision encourages models to generate lengthy but ineffective reasoning steps, while simultaneously discouraging internal "thinking" (implicit reasoning without verbalization).
The authors also highlight that task characteristics impact these results. Specifically, backtracking is more compute-efficient for tasks with deeper search trees, which explains why it succeeds in Sudoku (a deep trial-and-error game) but struggles against parallel sampling in CountDown (which has a shallower search tree).
Finally, the paper demonstrates how Reinforcement Learning (RL) via Group Relative Policy Optimization (GRPO) interacts differently with these architectures. When fine-tuned with RL, backtracking models gain the ability to discover novel, highly effective search strategies that surpass their original training data. In contrast, direct solution models trained with RL improve their one-shot (pass@1) accuracy but lose their ability to generate diverse solutions, sharply limiting their effectiveness in parallel search scenarios.