AI Papers: A Deep Dive

Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math


Listen Later

Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math

Source: RMA: an Agentic System for Research-Level Mathematical Problems

Paper was published on May 20, 2026

This episode was AI-generated on May 25, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A university research group just outperformed OpenAI and DeepMind's flagship math systems on a benchmark of problems contributed by working mathematicians — using the same base model that scores zero on its own. The trick wasn't a bigger model. It was decomposing the work of a mathematician into specialized agents sharing a structured whiteboard, and the implications for AI progress reach well beyond math.

Key Takeaways
  • How RMA, built on Claude Opus 4.6, solves 8 of 10 First Proof problems while the same model with no scaffolding solves 0
  • The seven-agent setup — initializer, three proposers, three verifiers — and why an append-only shared memory is what actually makes the rounds compound
  • The six modules that encode a working mathematician's workflow, including a Proof Commandment checklist and a pre-committed literature search designed to prevent contamination
  • Ablation results showing that stripping any major component — memory, verifiers, modules — collapses performance, and that more refinement rounds eventually makes proofs worse
  • Why the comparison to GPT-5.2R and Aletheia isn't apples-to-apples, and what the honest version of the claim actually is
  • The Spielman ε-light subset problem as a concrete case: GPT-5.2R hallucinates a citation and lands a weaker bound; RMA produces a clean proof with a tighter bound using a different known technique
    • 00:00 — The headline result on the First Proof benchmark
      RMA solves 8 of 10 expert-contributed problems while frontier systems and the base model alone score far lower.
    • 02:47 — The seven-agent setup and the shared whiteboard
      How initializer, proposer, and verifier agents iterate across five rounds through an append-only structured memory.
    • 05:35 — The six modules that encode mathematical workflow
      Problem analysis, knowledge bank, proof commandments, and the literature modules that turn a generic LLM into a math-research collaborator.
    • 08:23 — Methodological discipline against contamination
      Pre-committed search lists, sandboxed tools, and a training cutoff that predates the benchmark release.
    • 11:10 — The ablation table and the architecture-versus-scale claim
      Stripping modules, memory, or verifiers collapses win-rates, and a same-compute best-of-N baseline gets roughly half of RMA's performance.
    • 13:58 — Where the claims shouldn't be pushed too far
      Ten problems is a small sample, the industrial-system comparisons aren't controlled, and informal proofs resist bright-line evaluation.
    • 16:46 — The Spielman problem as a concrete illustration
      Three systems, three outcomes, and what the leverage-score proof reveals about applying known tools versus discovering new ones.
    • 19:34 — What this means for AI progress beyond math
      Why long-horizon reasoning tasks may benefit more from orchestration than from larger models, with appropriate caveats.
    • Recommended Reading
      • AlphaProof and AlphaGeometry: AI achieves silver-medal standard solving International Mathematical Olympiad problems — DeepMind's prior work on AI math reasoning, useful context for how industrial systems like Aletheia approach competition-level proofs versus the agentic orchestration approach in this episode.
      • Twice-Ramanujan Sparsifiers (Batson, Spielman, Srivastava) — The original barrier-method paper that GPT-5.2R reached for on the Spielman benchmark problem — worth reading to see the technique RMA chose not to use.
      • Self-Refine: Iterative Refinement with Self-Feedback — A foundational paper on the proposer-verifier refinement loop that RMA's multi-round architecture extends and stress-tests at research-math scale.
      • FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI — The expert-contributed math benchmark in the same spirit as First Proof, useful for situating how the field is currently measuring research-level mathematical capability.
      • ...more
        View all episodesView all episodes
        Download on the App Store

        AI Papers: A Deep DiveBy paperdive.ai