May 16, 2026

When Agent Benchmarks Lie: The Harness Problem in Open-Source AI

27 minutes

Source: Orchard: An Open-Source Agentic Modeling Framework

Paper was published on May 14, 2026

This episode was AI-generated on May 15, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A software-engineering agent scores 62% on its native test setup and 3.6% when you swap the wrapper around it. A new paper called Orchard argues this isn't a bug in one system — it's an indictment of how the entire open-source agent field has been measuring progress, and it offers an infrastructure-first fix that costs ten times less and actually generalizes.

Key Takeaways

Why most reported agent benchmark scores measure harness-fit rather than underlying capability, and how a cross-harness test exposes the gap

How treating the sandbox layer as a thin, generic service (rather than baked-in plumbing) cuts training costs roughly 10x versus managed services like E2B and Daytona

Credit-assignment SFT: extracting partial supervision from failed teacher trajectories by finding the rising segment before the critical mistake

Balanced Adaptive Rollout (BAR): a self-pacing RL technique that stops generating rollouts once a prompt yields a useful mix of wins and losses

The surprising GUI result — a 4B-parameter student beating its 235B teacher — and why environment-grounded RL teaches something distillation can't

Honest limitations the paper undersells: the cross-harness comparison is partly confounded, RL gains are measured on a curated subset, and the whole recipe depends on a few open frontier teacher models staying open

00:00 — The 62-to-3.6 collapse
Why swapping the harness around the same model weights can demolish benchmark scores, and what that says about the field.

03:27 — Agent, harness, and what Orchard actually is
Vocabulary for ReAct loops and harnesses, plus the dual nature of Orchard as both infrastructure and training recipes.

06:54 — Sandboxes as a thin service
The architectural case for pulling the environment layer out behind a small REST API, and the cost and latency numbers that follow.

10:21 — Credit-assignment SFT: salvaging failed trajectories
Using hindsight from the teacher model to extract training signal from the productive prefix of attempts that ultimately failed.

13:49 — Balanced Adaptive Rollout for RL
A self-pacing rollout strategy that ensures every gradient batch contains both successes and failures, turning prompt difficulty into a runtime curriculum.

23:17 — The cross-harness experiment
Evaluating competing agents on harnesses they weren't trained against, and what the resulting collapses reveal about generalization.

20:43 — When a 4B student beats a 235B teacher
The browser-agent result, the tennis-coach analogy for why outcome-grounded RL can exceed imitation, and the captcha problem hiding in the training data.

24:11 — What the paper undersells
Honest critiques: confounded harness diversity, RL gains measured on a curated subset, and the ecosystem's fragile dependence on a few open teacher models.