AI Papers: A Deep Dive

How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers


Listen Later

How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Source: SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

Paper was published on April 26, 2026

This episode was AI-generated on May 2, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

An ETH Zurich group sat down to reproduce the hot new mixed-policy methods for training reasoning models — and found their plain SFT baseline beating the published baselines by five points. Pulling the thread led to two silent bugs in widely-used training libraries that had been deflating baselines across an entire subfield for over a year, and to an uncomfortable question about how much of recent benchmark progress is real.

Key Takeaways
  • How a misplaced branch in DeepSpeed's CPU-offloading code silently discarded most micro-batch gradients during accumulation, shrinking effective training signal without any warning
  • Why a 'mean-of-means' loss aggregation bug in OpenRLHF systematically mis-weights SFT updates when response lengths vary, and how it migrated from pretraining code where it was harmless
  • A clean four-number staircase that attributes a five-point baseline gap almost entirely to the optimizer bug, with the loss bug contributing under a point
  • Why corrected SFT-then-RL beats every published mixed-policy method on math benchmarks — by 3.8 points on Qwen and a striking 22 points on Llama — at roughly half the FLOPs
  • The structural lesson: when a whole subfield's baselines flow through the same library, independent replication becomes illusory, and framework diversity functions as epistemic insurance
  • Where the paper's claims have real edges — single-seed reproductions, math-only benchmarks, and the open question of whether mixed-policy methods could still help on top of a properly trained SFT model
    • 00:00 — The reproduction that wouldn't reproduce
      How an ETH group's two SFT baselines, run in different frameworks with identical settings, disagreed by five-and-a-half points and started the investigation.
    • 02:52 — The DeepSpeed gradient accumulation bug
      A misplaced conditional in CPU-offloading code meant only the first micro-batch's gradients were ever copied to the optimizer — silently, for over a year.
    • 05:45 — Why only the baselines were sick
      The asymmetry that made the bug invisible: mixed-policy methods ran on healthy verl/FSDP infrastructure while their SFT baselines ran through DeepSpeed.
    • 08:38 — The mean-of-means loss bug
      How OpenRLHF's distributed loss aggregation systematically mis-weights tokens when batch sizes vary, and how it leaked in from pretraining code where it was harmless.
    • 11:30 — The four-number staircase
      A controlled ablation that attributes the baseline gap to each bug individually and shows the patched pipeline matching an independently implemented clean baseline.
    • 14:23 — Corrected baselines flip the field's conclusions
      On Qwen and especially on Llama, a properly trained SFT-then-RL pipeline beats every published mixed-policy method, often dramatically and at lower compute cost.
    • 17:16 — Where the paper's claims have edges
      A steelman pass on dataset scope, single-seed reproductions, hyperparameter tuning, and what the paper does and does not rule out about mixed-policy methods.
    • 20:08 — The structural lesson about shared infrastructure
      Why concentrated tooling turns independent replications into a single point of failure, and what framework diversity buys a subfield epistemically.
    • Recommended Reading
      • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — The paper that established the SFT-then-RL recipe this episode defends, and the baseline against which mixed-policy methods positioned themselves.
      • LUFFY: Learning to Reason under Off-Policy Guidance — One of the mixed-policy methods whose published advantage over SFT-then-RL the episode argues was an artifact of buggy baselines.
      • The Unreasonable Effectiveness of Eccentric Automatic Prompts — A different flavor of the same lesson — apparent model 'limitations' often turn out to be artifacts of the surrounding pipeline rather than the model itself.
      • ...more
        View all episodesView all episodes
        Download on the App Store

        AI Papers: A Deep DiveBy paperdive.ai