AI Papers: A Deep Dive

What RL Actually Does to Language Models, at the Token Level


Listen Later

What RL Actually Does to Language Models, at the Token Level

Source: Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

Paper was published on May 07, 2026

This episode was AI-generated on May 8, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A new paper argues that reinforcement learning on math reasoning isn't teaching language models new tricks — it's editing one to three percent of tokens, all of which the base model was already considering. If that's right, the elaborate RL pipelines behind frontier reasoning models may be solving a much smaller problem than their cost suggests, and a $25 training run can match a $103,000 one.

Key Takeaways
  • RL-trained and base models agree on 97-99% of tokens; where they differ, the RL model's choice is almost always already in the base model's top five
  • Disagreements concentrate at high-entropy 'fork' positions — moments where the base model is uncertain — at 7-12x the average entropy
  • A causal control (random substitution at the same positions) shows it's the specific token choices, not just the locations, that carry the benefit
  • ReasonMaxxer reproduces full RL accuracy using 50 problems, a tiny LoRA adapter, and a contrastive loss gated by base-model entropy — for around $25 on a 32B model
  • The mechanistic story is established only for math reasoning; pass-at-high-k isn't tested, and the cost comparisons partly rely on estimated baselines
  • The AlphaGo analogy for RL on LLMs is probably wrong: RL looks like calibration of an already-capable base model, not discovery of new strategies
    • 00:00 — The $103,000 vs. $25 result
      Framing the four-thousand-fold cost gap between a standard RL pipeline and the paper's alternative on the same 32B model.
    • 02:58 — What people thought RL was doing
      The AlphaGo-style framing that justified large RL post-training budgets, and prior hints (Yue, Davis & Recht, Wang) that it might be wrong.
    • 05:56 — The token-level observation
      Base and RL models agree on 97-99% of tokens, disagree only on the base model's top alternatives, and only at high-entropy positions.
    • 08:54 — The oracle intervention and random control
      A surgical experiment showing that patching just the disagreement tokens recovers RL's accuracy — and that random substitutions at the same positions don't.
    • 11:52 — Locating the edits without a teacher
      Entropy alone, computed from the base model, identifies the consequential positions; a tiny LoRA captures the parameter footprint of the change.
    • 14:51 — ReasonMaxxer: the constructive method
      How 50 problems, base-model rollouts, an entropy gate, and a contrastive loss reproduce RL's gains for a few dollars on a single GPU.
    • 17:49 — Where the argument is and isn't tight
      Caveats on math-only evidence, missing pass-at-k comparisons, estimated baseline costs, and the indirect link between mechanism and method.
    • 20:47 — Calibration, not composition
      Why the findings reframe RL as fine-tuning a model already mostly in tune, and what that implies for where reasoning capability really comes from.
    • Recommended Reading
      • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? — The Yue et al. pass-at-k paper the episode cites as the original evidence that RL collapses probability mass onto solutions the base model already contains.
      • Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think (entropy and high-uncertainty token analysis in RL for reasoning) — Connects to the episode's claim that RL's edits concentrate at high-entropy 'forking' tokens — the same signal ReasonMaxxer uses for gating.
      • LoRA: Low-Rank Adaptation of Large Language Models — The parameter-efficient fine-tuning method underlying ReasonMaxxer's claim that RL's correction fits into a tiny low-rank patch on top of the base model.
      • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — The flagship example of the expensive RL-for-reasoning pipeline whose necessity this episode's paper challenges.
      • ...more
        View all episodesView all episodes
        Download on the App Store

        AI Papers: A Deep DiveBy paperdive.ai