May 03, 2026

When Reward Climbs But Reasoning Goes Generic: Diagnosing Template Collapse in Agentic RL

22 minutes

Source: RAGEN-2: Reasoning Collapse in Agentic RL

Paper was published on April 07, 2026

This episode was AI-generated on May 2, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A new paper argues that the standard health metric for RL training of language models — entropy — can't see one of its most damaging failure modes. Models can produce fluent, varied-looking reasoning that has quietly stopped depending on the input at all, and the field's go-to dial points the wrong direction. The fix is a one-line change to the training loop that, on average, uses less compute and gets better results.

Key Takeaways

Why entropy conflates two independent axes of diversity — variation within a prompt and dependence on the prompt — and how that lets 'template collapse' run undetected

How a Shannon chain-rule decomposition turns the missing axis into a measurable quantity, and how 'cross-scoring' rollouts against other prompts in the batch makes it concrete

The Cauchy-Schwarz bound that mathematically caps the task gradient by the square root of reward variance — meaning low-variance prompts force regularizers to dominate the update

Why simply filtering out low-reward-variance prompts produced a 16-point absolute gain on Sokoban with PPO while cutting per-step compute by 26-41%

Where the method's gains are uneven, where the mutual-information proxy may be miscalibrated, and why the filter could risk a slow-motion exploration collapse

Why this reframes K-L penalties and entropy bonuses as regulating the wrong axis — controlling noise instead of amplifying weak task signal

00:00 — A failure mode the metrics can't see
Setting up the puzzle: reward and entropy look healthy while every chain of thought has silently converged to the same skeleton.

02:48 — Two axes of diversity, not one
How Shannon's chain rule splits entropy into within-input variation and cross-input mutual information, and why standard metrics only see the first.

05:36 — Cross-scoring and retrieval accuracy
The diagnostic that asks whether a reasoning trace 'knows' which prompt it came from, and what happens when retrieval drops to chance.

08:24 — Why entropy points the wrong way
The empirical reversal: the mutual-information proxy correlates positively with task performance while entropy correlates negatively.

11:13 — The mechanism: low signal, fixed noise
How a Cauchy-Schwarz bound on task gradients combined with input-agnostic regularizers explains why low reward variance pulls the model toward generic patterns.

14:01 — The fix: filter on reward variance
SNR-aware filtering and the quartile ablation that turns the correlation between variance and performance into a causal claim.

16:49 — Where the result holds up and where it doesn't
Pushback on correlation magnitudes, uneven gains across settings, self-likelihood limits of the proxy, and the question of long-horizon exploration.

19:37 — What this changes about RL training
Reframing K-L penalties and entropy bonuses, the connection to model-collapse literature, and the practical takeaways for practitioners.