Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math
Source: RMA: an Agentic System for Research-Level Mathematical Problems
Paper was published on May 20, 2026
This episode was AI-generated on May 25, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
A university research group just outperformed OpenAI and DeepMind's flagship math systems on a benchmark of problems contributed by working mathematicians — using the same base model that scores zero on its own. The trick wasn't a bigger model. It was decomposing the work of a mathematician into specialized agents sharing a structured whiteboard, and the implications for AI progress reach well beyond math.
Key Takeaways
How RMA, built on Claude Opus 4.6, solves 8 of 10 First Proof problems while the same model with no scaffolding solves 0The seven-agent setup — initializer, three proposers, three verifiers — and why an append-only shared memory is what actually makes the rounds compoundThe six modules that encode a working mathematician's workflow, including a Proof Commandment checklist and a pre-committed literature search designed to prevent contaminationAblation results showing that stripping any major component — memory, verifiers, modules — collapses performance, and that more refinement rounds eventually makes proofs worseWhy the comparison to GPT-5.2R and Aletheia isn't apples-to-apples, and what the honest version of the claim actually isThe Spielman ε-light subset problem as a concrete case: GPT-5.2R hallucinates a citation and lands a weaker bound; RMA produces a clean proof with a tighter bound using a different known technique00:00 — The headline result on the First Proof benchmark
RMA solves 8 of 10 expert-contributed problems while frontier systems and the base model alone score far lower.02:47 — The seven-agent setup and the shared whiteboard
How initializer, proposer, and verifier agents iterate across five rounds through an append-only structured memory.05:35 — The six modules that encode mathematical workflow
Problem analysis, knowledge bank, proof commandments, and the literature modules that turn a generic LLM into a math-research collaborator.08:23 — Methodological discipline against contamination
Pre-committed search lists, sandboxed tools, and a training cutoff that predates the benchmark release.11:10 — The ablation table and the architecture-versus-scale claim
Stripping modules, memory, or verifiers collapses win-rates, and a same-compute best-of-N baseline gets roughly half of RMA's performance.13:58 — Where the claims shouldn't be pushed too far
Ten problems is a small sample, the industrial-system comparisons aren't controlled, and informal proofs resist bright-line evaluation.16:46 — The Spielman problem as a concrete illustration
Three systems, three outcomes, and what the leverage-score proof reveals about applying known tools versus discovering new ones.19:34 — What this means for AI progress beyond math
Why long-horizon reasoning tasks may benefit more from orchestration than from larger models, with appropriate caveats.Recommended Reading
AlphaProof and AlphaGeometry: AI achieves silver-medal standard solving International Mathematical Olympiad problems — DeepMind's prior work on AI math reasoning, useful context for how industrial systems like Aletheia approach competition-level proofs versus the agentic orchestration approach in this episode.Twice-Ramanujan Sparsifiers (Batson, Spielman, Srivastava) — The original barrier-method paper that GPT-5.2R reached for on the Spielman benchmark problem — worth reading to see the technique RMA chose not to use.Self-Refine: Iterative Refinement with Self-Feedback — A foundational paper on the proposer-verifier refinement loop that RMA's multi-round architecture extends and stress-tests at research-math scale.FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI — The expert-contributed math benchmark in the same spirit as First Proof, useful for situating how the field is currently measuring research-level mathematical capability.