AI Papers: A Deep Dive

When RL Actually Teaches Agents Something New, And When It Doesn't


Listen Later

When RL Actually Teaches Agents Something New, And When It Doesn't

Source: Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

Paper was published on April 16, 2026

This episode was AI-generated on May 2, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A widely-cited result said reinforcement learning doesn't expand what language models can do — it just makes them more reliable at things they already half-knew. A new paper shows that conclusion was task-dependent, and builds a clean causal experiment to prove it: on multi-hop bridge questions, RL solves problems the base model can't solve at any sampling budget, while supervised fine-tuning on the same data actively makes things worse.

Key Takeaways
  • Why the pessimistic Yue et al. result on math reasoning — that RL and base models converge given enough tries — doesn't transfer to compositional agent tasks
  • The pass-at-k-T metric: a two-axis framework that separates 'more tries' from 'deeper interaction' and lets you measure capability boundaries directly
  • The headline asymmetry: on identical training data, RL gains four bridge problems net, while SFT loses four — and RL solves nine bridge problems SFT cannot, versus one the other way
  • Mechanism evidence that RL is reweighting the base model's existing strategies rather than teaching new ones — and that the novelty lives in reasoning over retrieved text, not in the search queries themselves
  • Why SFT on expert demonstrations collapsed strategy diversity by three times while RL preserved it, and what that implies for agent training pipelines
  • Honest limits: small scale (200 problems, 7B model), no temperature sweep, and modest absolute effect sizes that the narrative slightly oversells
    • 00:00 — The result that contradicts the prior consensus
      Setting up the headline finding — base and RL models tied at one shot, but the gap widens, not closes, as you give them more tries on bridge questions.
    • 02:51 — Efficiency versus capability, and why pass-at-k tests the difference
      The student-grade analogy that explains what unlimited-tries evaluation actually measures, and why Yue et al.'s convergence on math read as 'RL is just sampling efficiency.'
    • 05:43 — Why agent tasks break the re-sampling substitute
      Sequential dependence in bridge questions means parallel attempts can't recover what depth-of-interaction provides — motivating a two-axis metric, pass-at-k-T.
    • 08:35 — The causal experiment: SFT versus RL on identical data
      How training two model variants on the same 200 problems with different feedback signals isolates the learning signal as the cause of any divergence.
    • 11:27 — Three categories, three results
      Math as a negative control, modest gains on comparison questions, and the surprising bridge-question result where SFT regresses below the base model.
    • 14:19 — Reweighting, not replacement: what RL is actually doing
      Strategy diversity counts, a perplexity probe on queries versus reasoning, and the trajectory-novelty numbers that point to RL preserving the base distribution while SFT collapses it.
    • 17:11 — The skeptical case: scale, temperature, and effect sizes
      Where the paper's claims outrun its evidence — small training set, missing temperature sweep, wide confidence intervals, and the weakest leg of the mechanism story.
    • 20:02 — Reconciling the two findings, and what it means for building agents
      Why both the math-reasoning pessimism and the bridge-question optimism can be true under one mechanism, and the practical takeaway for anyone choosing between SFT and RL pipelines.
    • Recommended Reading
      • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? — The Yue et al. NeurIPS 2025 paper whose pessimistic pass-at-k convergence result on math reasoning is the foil this entire episode is structured against.
      • Evaluating Large Language Models Trained on Code — The original Codex paper, source of the unbiased pass-at-k estimator the authors extend into the two-axis pass-at-k-T framework discussed in the episode.
      • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — Introduces GRPO, the specific RL algorithm used to produce the trained agent whose capability-expansion behavior on bridge questions drives the paper's headline result.
      • HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering — The benchmark whose comparison-versus-bridge question split the paper exploits to operationalize sequential dependence — essential context for why the bridge-question result matters.
      • ...more
        View all episodesView all episodes
        Download on the App Store

        AI Papers: A Deep DiveBy paperdive.ai