May 26, 2026

Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick

31 minutes

Source: QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks

Paper was published on May 22, 2026

This episode was AI-generated on May 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

An academic lab just matched OpenAI's Deep Research on several benchmarks using only 8,000 training examples — a recipe small enough to fit on one figure. The key move is a single data structure, the rubric tree, that unifies fact-seeking and report-writing into one training signal. We dig into how it works, what it actually unlocks, and the proprietary teacher stack quietly sitting underneath the word "fully synthetic."

Key Takeaways

Why the rubric tree works as one primitive for both obscure-fact tasks and open-ended report writing — and how it doubles as a synthesis target, a supervision filter, and an RL reward

The context condenser as an epistemic state machine: trusted facts, contradicted claims, and open leads paired with the next action

Why supervised fine-tuning actually hurts open-ended report quality, and how RL with GRPO recovers it

The capped reward design that prevents the agent from gaming citation credit against task completion

Why "fully synthetic" really means "no human annotation" — and why the open-weight model is effectively a distillation of an ensemble of closed frontier models

A 2B-parameter model that beats o3 on GAIA and HLE fact-seeking, with the asterisk that it collapses on report writing

00:00 — What a deep research agent actually is
How autonomous research loops differ from RAG, and why the field has been fragmented across three capabilities — fact seeking, citation grounding, and report synthesis.

30:35 — The rubric tree as a unifying primitive
How a hierarchical, auto-checkable rubric replaces the 'single verifiable answer' format and lets one pipeline cover both objective and open-ended tasks.

09:31 — The context condenser
Compressing hundreds of tool calls into a structured state of trusted, contradicted, and uncertain claims — and why training-time and inference-time artifacts have to match.

11:40 — Mid-training, SFT, and RL
Walking through the three-stage pipeline and the capped reward formula that ties task completion to citation faithfulness.

15:34 — Ablations and the alignment tax
Why SFT alone regresses report quality, why RL recovers it, and the unresolved tradeoff between open-ended performance and hard reasoning benchmarks.

19:27 — What didn't work
The unsuccessful attempts section — pointwise judge bias, DPO collapse on long reports, and why relative scoring against a reference was the move that survived.

23:21 — The teacher dependency steelman
What 'fully synthetic' actually means when the rubric generator, judge, fact-checker, and SFT teacher are all proprietary frontier models.

27:15 — What the paper unlocks
Why the rubric tree is the durable contribution, what a laptop-deployable research agent enables, and how to read future papers in this line.