May 09, 2026

What RL Actually Does to Language Models, at the Token Level

23 minutes

Source: Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

Paper was published on May 07, 2026

This episode was AI-generated on May 8, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A new paper argues that reinforcement learning on math reasoning isn't teaching language models new tricks — it's editing one to three percent of tokens, all of which the base model was already considering. If that's right, the elaborate RL pipelines behind frontier reasoning models may be solving a much smaller problem than their cost suggests, and a $25 training run can match a $103,000 one.

Key Takeaways

RL-trained and base models agree on 97-99% of tokens; where they differ, the RL model's choice is almost always already in the base model's top five

Disagreements concentrate at high-entropy 'fork' positions — moments where the base model is uncertain — at 7-12x the average entropy

A causal control (random substitution at the same positions) shows it's the specific token choices, not just the locations, that carry the benefit

ReasonMaxxer reproduces full RL accuracy using 50 problems, a tiny LoRA adapter, and a contrastive loss gated by base-model entropy — for around $25 on a 32B model

The mechanistic story is established only for math reasoning; pass-at-high-k isn't tested, and the cost comparisons partly rely on estimated baselines

The AlphaGo analogy for RL on LLMs is probably wrong: RL looks like calibration of an already-capable base model, not discovery of new strategies

00:00 — The $103,000 vs. $25 result
Framing the four-thousand-fold cost gap between a standard RL pipeline and the paper's alternative on the same 32B model.

02:58 — What people thought RL was doing
The AlphaGo-style framing that justified large RL post-training budgets, and prior hints (Yue, Davis & Recht, Wang) that it might be wrong.

05:56 — The token-level observation
Base and RL models agree on 97-99% of tokens, disagree only on the base model's top alternatives, and only at high-entropy positions.

08:54 — The oracle intervention and random control
A surgical experiment showing that patching just the disagreement tokens recovers RL's accuracy — and that random substitutions at the same positions don't.

11:52 — Locating the edits without a teacher
Entropy alone, computed from the base model, identifies the consequential positions; a tiny LoRA captures the parameter footprint of the change.

14:51 — ReasonMaxxer: the constructive method
How 50 problems, base-model rollouts, an entropy gate, and a contrastive loss reproduce RL's gains for a few dollars on a single GPU.

17:49 — Where the argument is and isn't tight
Caveats on math-only evidence, missing pass-at-k comparisons, estimated baseline costs, and the indirect link between mechanism and method.

20:47 — Calibration, not composition
Why the findings reframe RL as fine-tuning a model already mostly in tune, and what that implies for where reasoning capability really comes from.