May 03, 2026

When RL Actually Teaches Agents Something New, And When It Doesn't

22 minutes

Source: Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

Paper was published on April 16, 2026

This episode was AI-generated on May 2, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A widely-cited result said reinforcement learning doesn't expand what language models can do — it just makes them more reliable at things they already half-knew. A new paper shows that conclusion was task-dependent, and builds a clean causal experiment to prove it: on multi-hop bridge questions, RL solves problems the base model can't solve at any sampling budget, while supervised fine-tuning on the same data actively makes things worse.

Key Takeaways

Why the pessimistic Yue et al. result on math reasoning — that RL and base models converge given enough tries — doesn't transfer to compositional agent tasks

The pass-at-k-T metric: a two-axis framework that separates 'more tries' from 'deeper interaction' and lets you measure capability boundaries directly

The headline asymmetry: on identical training data, RL gains four bridge problems net, while SFT loses four — and RL solves nine bridge problems SFT cannot, versus one the other way

Mechanism evidence that RL is reweighting the base model's existing strategies rather than teaching new ones — and that the novelty lives in reasoning over retrieved text, not in the search queries themselves

Why SFT on expert demonstrations collapsed strategy diversity by three times while RL preserved it, and what that implies for agent training pipelines

Honest limits: small scale (200 problems, 7B model), no temperature sweep, and modest absolute effect sizes that the narrative slightly oversells

00:00 — The result that contradicts the prior consensus
Setting up the headline finding — base and RL models tied at one shot, but the gap widens, not closes, as you give them more tries on bridge questions.

02:51 — Efficiency versus capability, and why pass-at-k tests the difference
The student-grade analogy that explains what unlimited-tries evaluation actually measures, and why Yue et al.'s convergence on math read as 'RL is just sampling efficiency.'

05:43 — Why agent tasks break the re-sampling substitute
Sequential dependence in bridge questions means parallel attempts can't recover what depth-of-interaction provides — motivating a two-axis metric, pass-at-k-T.

08:35 — The causal experiment: SFT versus RL on identical data
How training two model variants on the same 200 problems with different feedback signals isolates the learning signal as the cause of any divergence.

11:27 — Three categories, three results
Math as a negative control, modest gains on comparison questions, and the surprising bridge-question result where SFT regresses below the base model.

14:19 — Reweighting, not replacement: what RL is actually doing
Strategy diversity counts, a perplexity probe on queries versus reasoning, and the trajectory-novelty numbers that point to RL preserving the base distribution while SFT collapses it.

17:11 — The skeptical case: scale, temperature, and effect sizes
Where the paper's claims outrun its evidence — small training set, missing temperature sweep, wide confidence intervals, and the weakest leg of the mechanism story.

20:02 — Reconciling the two findings, and what it means for building agents
Why both the math-reasoning pessimism and the bridge-question optimism can be true under one mechanism, and the practical takeaway for anyone choosing between SFT and RL pipelines.

When RL Actually Teaches Agents Something New, And When It Doesn't

22 minutes

When RL Actually Teaches Agents Something New, And When It Doesn't

Source: Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

Paper was published on April 16, 2026

Key Takeaways

Why the pessimistic Yue et al. result on math reasoning — that RL and base models converge given enough tries — doesn't transfer to compositional agent tasks

The pass-at-k-T metric: a two-axis framework that separates 'more tries' from 'deeper interaction' and lets you measure capability boundaries directly

The headline asymmetry: on identical training data, RL gains four bridge problems net, while SFT loses four — and RL solves nine bridge problems SFT cannot, versus one the other way

Why SFT on expert demonstrations collapsed strategy diversity by three times while RL preserved it, and what that implies for agent training pipelines

Honest limits: small scale (200 problems, 7B model), no temperature sweep, and modest absolute effect sizes that the narrative slightly oversells

11:27 — Three categories, three results
Math as a negative control, modest gains on comparison questions, and the surprising bridge-question result where SFT regresses below the base model.

Share When RL Actually Teaches Agents Something New, And When It Doesn't

Sign up to save your podcasts

When RL Actually Teaches Agents Something New, And When It Doesn't

When RL Actually Teaches Agents Something New, And When It Doesn't