May 19, 2026

Why Parallel Sampling Plateaus, And What Evidence Graphs Do Instead

22 minutes

Source: Argus: Evidence Assembly for Scalable Deep Research Agents

Paper was published on May 15, 2026

This episode was AI-generated on May 18, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

Running 64 web-browsing agents in parallel and voting on their answers barely beats running one — because correlated agents make correlated mistakes. A new paper from MiroMind AI proposes an architectural fix: a swarm of Searchers that assemble evidence into a shared graph, and a single Navigator that reads only the graph, achieving a 1200-to-1 compression ratio that finally lets parallel scaling keep paying off.

Key Takeaways

Why majority-vote scaling flattens fast: parallel agents sample from the same distribution and reinforce the same wrong answers

How the Searcher/Navigator split turns parallelism from 'voting over guesses' into 'assembling pieces of a jigsaw'

What the evidence DAG actually buys you — typed support/refute edges make contradictions visible as graph structure, not text to re-read

The contrastive 'shadow pass' reward that fixes the proofreader problem in agent RL by measuring whether verification work actually changed the answer

A walkthrough of the Jesse Duroha → Nicholas Constant case study, where structural self-doubt catches an error that voting never would

Honest limitations: MiroMind benchmarking against its own backbone, the Navigator as a serialization bottleneck, and the schema-coordination problem if this approach were to spread

00:00 — Why majority voting on agents plateaus
The opening framing: 64 correlated agents asked the same question think the same wrong thoughts, and voting can't fix correlated mistakes.

02:27 — The Searcher/Navigator split and the 1200-to-1 compression
How separating field reporters from the editor — and never letting the editor read raw transcripts — decouples accuracy from context length.

04:55 — The evidence DAG as a detective's corkboard
Why typed support and refute edges let the Navigator see contradictions and gaps as structure rather than re-deriving them from text.

07:23 — Three operating modes and the scaling curve
Solo Searcher, Navigator plus one, and Navigator plus many — and how accuracy climbs log-linearly on BrowseComp as Searcher count grows.

09:51 — The proofreader problem and the contrastive reward
Why outcome rewards teach the Navigator to take credit it didn't earn, and how a shadow-pass counterfactual produces a sharper training signal.

12:19 — Case study: catching a confident wrong answer
A worked example where the Navigator notices untouched constraints, dispatches an alternative-hypothesis probe, and flips a wrong answer to a right one.

14:47 — Zero-shot transfer across Searcher backbones
Evidence that the Navigator learned a general 'read the graph, find the gaps' skill rather than memorizing one Searcher's quirks.

17:15 — Limitations and what to be skeptical of
Self-benchmarking, schema coordination across labs, training-vs-deployment counterfactual gaps, and the Navigator as a future serialization bottleneck.

19:43 — The bigger reframe: structure outside the model
Why Argus is a bet against pure context-length scaling and in favor of external scaffolding that persists and gets read in compressed form.