May 17, 2026

“Benchmarking Real Work” by kaivu, leni, rohuang, zef

7 minutes

Thanks to Megan Kinniment for helpful comments and discussion.

TL;DR: Benchmarks like HCAST undersample fuzzy (hard to evaluate) tasks, meaning they might overestimate capability on long-horizon work. To sample fuzzy tasks we need to increase judge capacity: we can either try to build automated judges that match human judgment, or reduce the human effort per grade. To do this, we propose generating fuzzy tasks as a byproduct of real SWE work — snapshot the repo and a proto-spec before starting, and after finishing, use an AI transform to produce an executable spec and LLM-judge conditions. Because the engineer just did the work, verifying the judges or grading the agent directly is much cheaper than grading the task from scratch. I think this would be a good way to collect tasks, as well as a useful personal epistemic tool.

This is a two-part series on capability evaluation. Part 1 is about acquiring fuzzy tasks, and part 2 is about analyzing them.

Motivation: sampling bias in HCAST

There are several well-described limitations of time horizons. But the strongest reason that I don’t update that much on trends in time horizons (and time horizon-like tasks) is because I think all existing evaluations [...]

---

Outline:

(01:14) Motivation: sampling bias in HCAST

(02:47) Making fuzzy tasks sampling viable by increasing judge capacity

(04:02) Proposal: sampling from real work

(05:18) Advantages

(06:10) Discussion

(06:13) How inconvenient is this?

(06:32) Can we test fuzzy skills by just testing longer tasks?

The original text contained 3 footnotes which were omitted from this narration.

---

First published: