May 01, 2026

How to Pick the Best of Sixteen Coding Agent Rollouts

17 minutes

Source: Scaling Test-Time Compute for Agentic Coding

Paper was published on April 16, 2026

This episode was AI-generated on May 1, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

When an AI coding agent takes forty steps and tens of thousands of tokens to fix a single bug, running sixteen attempts in parallel is easy — picking the winner is the hard part. A new paper from Meta Superintelligence Labs argues the real bottleneck in agentic test-time scaling isn't compute, it's representation: you can't select what you can't compare, and you can't reuse what you can't summarize.

Key Takeaways

Why classic test-time scaling tricks like majority voting break down when the unit of work is a 40,000-token interactive session

How Recursive Tournament Voting uses pairwise bracket-style judging on compressed rollout summaries to pick a winner — and why pairwise beats flat ranking

The near-deterministic finding that the quality of priors passed to a second wave of attempts essentially determines whether those attempts succeed

Concrete gains: 6–16 percentage points on SWE-Bench Verified and Terminal-Bench v2 across Claude and Gemini, plus a 3x drop in steps-per-attempt after refinement

Where the pipeline gets worse: refinement is a redistribution, not a strict improvement — more tasks become uniformly solvable, but more also become uniformly unsolvable

Why the judge being the same model as the generator is the load-bearing weakness, and why a dedicated trained judge is the obvious next step

00:00 — Why voting fails for agentic rollouts
The framing problem: standard test-time scaling assumes outputs are small and clean, but agent rollouts are sprawling interactive sessions that can't be compared directly.

02:08 — Summarization as the load-bearing move
Why compressing each rollout into a structured 'lab notebook' summary is the prerequisite that makes every other step in the pipeline tractable.

04:16 — Recursive Tournament Voting explained
How a single-elimination bracket of pairwise judgments on summaries produces a winner, and why pairwise comparison beats asking the judge to rank everything at once.

06:24 — Parallel-Distill-Refine and the relay race
The second-wave mechanism: a fresh batch of sixteen attempts that each begin by reading the top four summaries from the first wave.

08:33 — The headline numbers and step efficiency
Accuracy gains across Claude and Gemini on SWE-Bench and Terminal-Bench, plus the surprising finding that refined attempts succeed in roughly a third as many steps.

10:41 — The context-quality finding that justifies the architecture
A near-deterministic relationship between how many of the four priors solved the task and whether the next attempt succeeds — which is what makes the tournament filter essential rather than decorative.

12:49 — Steelman: where the pipeline is fragile
The judge's correlated blind spots, the bimodal collapse on hard tasks, untested generalization beyond pass/fail coding benchmarks, and the unmeasured dependence on summary quality.

14:58 — Representation, not compute, as the new frontier
Why this paper functions less as a technique and more as a marker for a shift toward making sequences of attempts collectively smarter than any single one.