May 28, 2026

When Search Agents Don't Really Search: The Memory Shortcut Hiding in Browsing Benchmarks

27 minutes

Source: LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

Paper was published on May 27, 2026

This episode was AI-generated on May 28, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

Unplug a top AI search agent's internet connection and it still answers 44% of questions on a benchmark designed to require browsing. That uncomfortable result is the opening move in a paper that argues current search agents aren't really searching — they're verifying what they already know — and that the field's leaderboards have been measuring the wrong capability.

Key Takeaways

Why frontier search agents score nearly 39% on browsing benchmarks with no tools at all — and why this isn't data contamination

The evidence-blocking experiment: when given a search tool that can't find the answer, agents drop *below* their no-tools baseline, because hard negatives actively pull them off course

How trajectory analysis shows over half of agent queries are seeded by entities the model invented in its own reasoning, not extracted from retrieved documents

The construction logic behind LiveBrowseComp — recent plus obscure — and why a human-time control rules out 'it's just harder' as an explanation

Why the deployment risk is structural: agents are most reliable when you don't need them, and collapse silently when you do

The honest steelman: where the IKD framing leans on the evidence-blocking result to do the load-bearing interpretive work

04:29 — The closed-book result
Pulling search tools off frontier agents reveals they already answer a large fraction of 'requires browsing' questions from memory alone.

03:01 — Why this isn't contamination
The distinction between leaked benchmark questions and broad world knowledge covering the answer territory — and why decontamination can't fix the latter.

06:03 — Evidence-blocking: the centerpiece experiment
Removing the supporting documents from the index while leaving hard negatives in place causes performance to collapse below the no-tools floor across every model tested.

09:05 — The open-book exam analogy
Why the failure pattern looks like a confident student rubber-stamping a textbook rather than reading it — and what that means for robustness.

12:07 — Trajectory analysis and Intrinsic Knowledge Dependence
Measuring where query entities come from and how often agents actually use retrieved evidence, leading to the paper's named failure mode: memory-backed verification rather than evidence-driven discovery.

15:09 — Building LiveBrowseComp
The recent-plus-obscure construction across six structured sources, designed to push answers outside any model's parametric memory.

18:10 — The human-time control and the reshuffled leaderboard
Why human solve rates and timing on both benchmarks are nearly identical, anchoring the claim that agent collapse on LiveBrowseComp reflects suppressed IKD rather than harder questions.

21:12 — Steelmanning the critique
Where the evidence-blocking setup is adversarial, where the IKD inference is underdetermined, and what survives the strongest version of the skeptic's case.

24:14 — The deployment inversion
Why these agents are most reliable in the regime where you don't need them and least reliable — silently — in the regime where search is the whole point.