AI Papers: A Deep Dive

How a 30B Open Model Reached Olympiad Gold With the Right Recipe


Listen Later

How a 30B Open Model Reached Olympiad Gold With the Right Recipe

Source: Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

Paper was published on May 13, 2026

This episode was AI-generated on May 16, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A thirty-billion-parameter open-source model just matched the top human score on the 2026 USAMO — and the full training recipe is public. The result suggests that olympiad-grade proof reasoning, long assumed to require trillion-parameter frontier systems, may have been more about training procedure than raw scale.

Key Takeaways
  • Why proof-writing and answer-finding are fundamentally different skills, and how models can score 95% on answer-based math but 20% on proof benchmarks
  • The reverse-perplexity curriculum: feeding the model its most surprising training examples first, and why it beats both random and easy-first ordering
  • How a two-stage RL progression — cheap verifiable rewards, then expensive proof-quality rewards — extracts more capability than either alone
  • The test-time scaling loop where the model writes ~100,000-token proofs, critiques them, and iterates up to 30 times per attempt
  • The honest asterisks: human-vs-model comparison conditions, grading regime differences, and the substantial inference compute the recipe requires
  • Where SU-01 is genuinely strong (formally tractable problems) versus where it still fails (global combinatorial structure and delicate invariants)
    • 00:00 — The headline result and why it's different
      SU-01 matches the top human USAMO 2026 score using an open 30B mixture-of-experts model with a fully documented recipe.
    • 03:27 — Answer-finding versus proof-writing
      Why olympiad problems expose a hidden ceiling that answer-based benchmarks miss, and how the paper frames the core training challenge.
    • 06:55 — Stage one: reverse-perplexity curriculum
      The counterintuitive idea of training on the most surprising examples first, with ablations showing it nearly doubles performance over easy-first ordering.
    • 10:23 — Stage two: coarse RL on verifiable answers
      Why sequence-level GSPO works for MoE models where standard token-level GRPO breaks down, and how this stage halves the gap to frontier performance.
    • 13:51 — Stage three: refined RL with a proof-grading judge
      Moving from answer-correctness to proof-validity rewards, with self-refinement on failures and a low-entropy experience replay buffer for rare successes.
    • 17:19 — Stage four: test-time scaling and 100,000-token proofs
      The inference-time loop of drafting, self-critiquing, and iteratively repairing proofs that produces the gold-medal-level scores.
    • 20:46 — The steelman: asterisks on the headline numbers
      Honest accounting of grading regime differences, inference compute costs, and what the human-comparison framing does and doesn't support.
    • 24:14 — Where the model is strong and where it fails
      A capability map showing SU-01 excels at problems with rigid formal structure but struggles with global combinatorial arguments — including the elegant complex-number solution to USAMO Problem 3.
    • 27:42 — Specializable generalists and what's actually new
      The paper's broader claim that training recipe — not just scale — is doing more of the work, and which methodological contributions transfer beyond olympiad math.
    • Recommended Reading
      • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — Introduces GRPO, the token-level policy optimization algorithm that SU-01's authors deliberately swap out for the sequence-level GSPO to handle their MoE backbone.
      • Solving Olympiad Geometry Without Human Demonstrations (AlphaGeometry) — The DeepMind approach this episode positions as the contrast case — a bespoke neuro-symbolic olympiad solver, against which SU-01's 'specializable-generalist' recipe is the alternative philosophy.
      • Let's Verify Step by Step — OpenAI's process-reward work motivates why grading proofs step-by-step (the spirit of SU-01's refined RL stage) outperforms reward signals based only on final-answer correctness.
      • Self-Refine: Iterative Refinement with Self-Feedback — An early articulation of the draft–critique–repair loop that SU-01 scales up dramatically in its 100k-token test-time inference procedure.
      • ...more
        View all episodesView all episodes
        Download on the App Store

        AI Papers: A Deep DiveBy paperdive.ai