Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick
Source: QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks
Paper was published on May 22, 2026
This episode was AI-generated on May 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
An academic lab just matched OpenAI's Deep Research on several benchmarks using only 8,000 training examples — a recipe small enough to fit on one figure. The key move is a single data structure, the rubric tree, that unifies fact-seeking and report-writing into one training signal. We dig into how it works, what it actually unlocks, and the proprietary teacher stack quietly sitting underneath the word "fully synthetic."
Key Takeaways
Why the rubric tree works as one primitive for both obscure-fact tasks and open-ended report writing — and how it doubles as a synthesis target, a supervision filter, and an RL rewardThe context condenser as an epistemic state machine: trusted facts, contradicted claims, and open leads paired with the next actionWhy supervised fine-tuning actually hurts open-ended report quality, and how RL with GRPO recovers itThe capped reward design that prevents the agent from gaming citation credit against task completionWhy "fully synthetic" really means "no human annotation" — and why the open-weight model is effectively a distillation of an ensemble of closed frontier modelsA 2B-parameter model that beats o3 on GAIA and HLE fact-seeking, with the asterisk that it collapses on report writing00:00 — What a deep research agent actually is
How autonomous research loops differ from RAG, and why the field has been fragmented across three capabilities — fact seeking, citation grounding, and report synthesis.30:35 — The rubric tree as a unifying primitive
How a hierarchical, auto-checkable rubric replaces the 'single verifiable answer' format and lets one pipeline cover both objective and open-ended tasks.09:31 — The context condenser
Compressing hundreds of tool calls into a structured state of trusted, contradicted, and uncertain claims — and why training-time and inference-time artifacts have to match.11:40 — Mid-training, SFT, and RL
Walking through the three-stage pipeline and the capped reward formula that ties task completion to citation faithfulness.15:34 — Ablations and the alignment tax
Why SFT alone regresses report quality, why RL recovers it, and the unresolved tradeoff between open-ended performance and hard reasoning benchmarks.19:27 — What didn't work
The unsuccessful attempts section — pointwise judge bias, DPO collapse on long reports, and why relative scoring against a reference was the move that survived.23:21 — The teacher dependency steelman
What 'fully synthetic' actually means when the rubric generator, judge, fact-checker, and SFT teacher are all proprietary frontier models.27:15 — What the paper unlocks
Why the rubric tree is the durable contribution, what a laptop-deployable research agent enables, and how to read future papers in this line.Recommended Reading
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge — The Ohio State group's earlier hand-crafted precursor to QUEST, where the rubric-tree evaluation primitive was first developed before being automated in this episode's paper.DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — Introduces the GRPO algorithm whose relative-advantage scoring QUEST uses, and which the episode connects to QUEST's relative-to-reference judging trick.Tongyi DeepResearch Technical Report — Alibaba's open deep research agent, which serves as QUEST's SFT teacher and is the fact-seeking-strong, report-weak baseline the episode contrasts against.GAIA: A Benchmark for General AI Assistants — The hard-reasoning agent benchmark the episode repeatedly cites when discussing QUEST's small-model results and the alignment-tax tradeoff from RL training.