Ten Thousand Examples Beat the Full Industrial Pipeline for Search Agents
Source: OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
Paper was published on May 05, 2026
This episode was AI-generated on May 6, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
A university team fine-tuned an open-weights model on roughly ten thousand examples and beat Alibaba's industrially-trained search agent on every benchmark — using one-third of the standard training pipeline. The result is an argument about what reinforcement learning was actually doing for these systems, and whether the field has been spending compute to fix a data problem.
Key Takeaways
Why a 16.5-point benchmark jump from v1 to v2 came entirely from changing the training data, not the model or methodThe three data changes — bigger knowledge graph chunks, expanded toolkits, and a hard minimum-tool-call filter — and the single idea behind themWhy imitation learning may inherit the 'patience' of its demonstrations, making RL-style long-horizon polish less necessary than assumedWhere the paper's framing oversells: the base model is itself the product of a full industrial pre-training runWhat the paper conspicuously doesn't do: no ablations isolating the three data changes, no variance across seeds, no validation that trajectory length tracks difficultyWhy the result reshapes a research program rather than just topping a leaderboard — if it generalizes beyond search agents00:00 — What a search agent actually does
Setting up the ReAct loop and the texture of training examples that average 65 tool calls each.01:45 — The three-stage pipeline and its implicit assumption
Why the field assumed pre-training, fine-tuning, and reinforcement learning each install something the others can't.03:31 — The v1-to-v2 jump: same model, same method, different data
The cleanest piece of internal evidence — a 16.5-point BrowseComp gain from data changes alone.05:16 — The three data changes and the marathon-runner intuition
Bigger graph chunks, more diverse tools, and a hard filter that throws out any trajectory the agent solved too quickly.07:02 — The benchmark results against Tongyi and the giants
Beating Alibaba's same-size agent on every benchmark, and a 30B model outscoring 671B DeepSeek-V3.1 on BrowseComp.08:47 — Where the paper's framing oversells
The base model still came from a full industrial pre-training run, so the claim is narrower than the abstract suggests.10:33 — Missing ablations, missing variance, and the length-as-difficulty proxy
The methodological soft spots: no isolation of which data change matters, no seed variance, and an unvalidated proxy for difficulty.12:18 — What this means for resource allocation in the field
If RL was largely compensating for weak fine-tuning data, the implication reshapes how labs should spend compute — assuming it generalizes.Recommended Reading
ReAct: Synergizing Reasoning and Acting in Language Models — The original ReAct paper that introduced the reason-act-observe loop the episode uses to define what a search agent actually is.LIMA: Less Is More for Alignment — A precursor in spirit to this episode's argument — showing that a small number of carefully curated fine-tuning examples can match much heavier post-training pipelines.BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents — The headline benchmark behind the episode's v1-to-v2 jump and the comparisons against Tongyi DeepResearch and much larger frontier models.Humanity's Last Exam — The brutal multi-domain expert benchmark cited in the episode's results table, useful for understanding what 'hard question' means at the frontier.