When Search Agents Don't Really Search: The Memory Shortcut Hiding in Browsing Benchmarks
Source: LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
Paper was published on May 27, 2026
This episode was AI-generated on May 28, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
Unplug a top AI search agent's internet connection and it still answers 44% of questions on a benchmark designed to require browsing. That uncomfortable result is the opening move in a paper that argues current search agents aren't really searching — they're verifying what they already know — and that the field's leaderboards have been measuring the wrong capability.
Key Takeaways
Why frontier search agents score nearly 39% on browsing benchmarks with no tools at all — and why this isn't data contaminationThe evidence-blocking experiment: when given a search tool that can't find the answer, agents drop *below* their no-tools baseline, because hard negatives actively pull them off courseHow trajectory analysis shows over half of agent queries are seeded by entities the model invented in its own reasoning, not extracted from retrieved documentsThe construction logic behind LiveBrowseComp — recent plus obscure — and why a human-time control rules out 'it's just harder' as an explanationWhy the deployment risk is structural: agents are most reliable when you don't need them, and collapse silently when you doThe honest steelman: where the IKD framing leans on the evidence-blocking result to do the load-bearing interpretive work04:29 — The closed-book result
Pulling search tools off frontier agents reveals they already answer a large fraction of 'requires browsing' questions from memory alone.03:01 — Why this isn't contamination
The distinction between leaked benchmark questions and broad world knowledge covering the answer territory — and why decontamination can't fix the latter.06:03 — Evidence-blocking: the centerpiece experiment
Removing the supporting documents from the index while leaving hard negatives in place causes performance to collapse below the no-tools floor across every model tested.09:05 — The open-book exam analogy
Why the failure pattern looks like a confident student rubber-stamping a textbook rather than reading it — and what that means for robustness.12:07 — Trajectory analysis and Intrinsic Knowledge Dependence
Measuring where query entities come from and how often agents actually use retrieved evidence, leading to the paper's named failure mode: memory-backed verification rather than evidence-driven discovery.15:09 — Building LiveBrowseComp
The recent-plus-obscure construction across six structured sources, designed to push answers outside any model's parametric memory.18:10 — The human-time control and the reshuffled leaderboard
Why human solve rates and timing on both benchmarks are nearly identical, anchoring the claim that agent collapse on LiveBrowseComp reflects suppressed IKD rather than harder questions.21:12 — Steelmanning the critique
Where the evidence-blocking setup is adversarial, where the IKD inference is underdetermined, and what survives the strongest version of the skeptic's case.24:14 — The deployment inversion
Why these agents are most reliable in the regime where you don't need them and least reliable — silently — in the regime where search is the whole point.Recommended Reading
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents — The original benchmark that this episode's paper diagnoses as partially measuring parametric memory rather than search ability.BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent — The annotated retrieval-index version of BrowseComp that enables the evidence-blocking experiment central to the episode's IKD diagnosis.BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese — The Chinese-language browsing benchmark whose tight ranking correlation with BrowseComp — versus the weak correlation with LiveBrowseComp — anchors the episode's claim that static benchmarks measure something different from live search.