AI Papers: A Deep Dive

When Search Agents Don't Really Search: The Memory Shortcut Hiding in Browsing Benchmarks


Listen Later

When Search Agents Don't Really Search: The Memory Shortcut Hiding in Browsing Benchmarks

Source: LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

Paper was published on May 27, 2026

This episode was AI-generated on May 28, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

Unplug a top AI search agent's internet connection and it still answers 44% of questions on a benchmark designed to require browsing. That uncomfortable result is the opening move in a paper that argues current search agents aren't really searching — they're verifying what they already know — and that the field's leaderboards have been measuring the wrong capability.

Key Takeaways
  • Why frontier search agents score nearly 39% on browsing benchmarks with no tools at all — and why this isn't data contamination
  • The evidence-blocking experiment: when given a search tool that can't find the answer, agents drop *below* their no-tools baseline, because hard negatives actively pull them off course
  • How trajectory analysis shows over half of agent queries are seeded by entities the model invented in its own reasoning, not extracted from retrieved documents
  • The construction logic behind LiveBrowseComp — recent plus obscure — and why a human-time control rules out 'it's just harder' as an explanation
  • Why the deployment risk is structural: agents are most reliable when you don't need them, and collapse silently when you do
  • The honest steelman: where the IKD framing leans on the evidence-blocking result to do the load-bearing interpretive work
    • 04:29 — The closed-book result
      Pulling search tools off frontier agents reveals they already answer a large fraction of 'requires browsing' questions from memory alone.
    • 03:01 — Why this isn't contamination
      The distinction between leaked benchmark questions and broad world knowledge covering the answer territory — and why decontamination can't fix the latter.
    • 06:03 — Evidence-blocking: the centerpiece experiment
      Removing the supporting documents from the index while leaving hard negatives in place causes performance to collapse below the no-tools floor across every model tested.
    • 09:05 — The open-book exam analogy
      Why the failure pattern looks like a confident student rubber-stamping a textbook rather than reading it — and what that means for robustness.
    • 12:07 — Trajectory analysis and Intrinsic Knowledge Dependence
      Measuring where query entities come from and how often agents actually use retrieved evidence, leading to the paper's named failure mode: memory-backed verification rather than evidence-driven discovery.
    • 15:09 — Building LiveBrowseComp
      The recent-plus-obscure construction across six structured sources, designed to push answers outside any model's parametric memory.
    • 18:10 — The human-time control and the reshuffled leaderboard
      Why human solve rates and timing on both benchmarks are nearly identical, anchoring the claim that agent collapse on LiveBrowseComp reflects suppressed IKD rather than harder questions.
    • 21:12 — Steelmanning the critique
      Where the evidence-blocking setup is adversarial, where the IKD inference is underdetermined, and what survives the strongest version of the skeptic's case.
    • 24:14 — The deployment inversion
      Why these agents are most reliable in the regime where you don't need them and least reliable — silently — in the regime where search is the whole point.
    • Recommended Reading
      • BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents — The original benchmark that this episode's paper diagnoses as partially measuring parametric memory rather than search ability.
      • BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent — The annotated retrieval-index version of BrowseComp that enables the evidence-blocking experiment central to the episode's IKD diagnosis.
      • BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese — The Chinese-language browsing benchmark whose tight ranking correlation with BrowseComp — versus the weak correlation with LiveBrowseComp — anchors the episode's claim that static benchmarks measure something different from live search.
      • ...more
        View all episodesView all episodes
        Download on the App Store

        AI Papers: A Deep DiveBy paperdive.ai