LessWrong (30+ Karma)

“Benchmarking Real Work” by kaivu, leni, rohuang, zef


Listen Later

Thanks to Megan Kinniment for helpful comments and discussion.

TL;DR: Benchmarks like HCAST undersample fuzzy (hard to evaluate) tasks, meaning they might overestimate capability on long-horizon work. To sample fuzzy tasks we need to increase judge capacity: we can either try to build automated judges that match human judgment, or reduce the human effort per grade. To do this, we propose generating fuzzy tasks as a byproduct of real SWE work — snapshot the repo and a proto-spec before starting, and after finishing, use an AI transform to produce an executable spec and LLM-judge conditions. Because the engineer just did the work, verifying the judges or grading the agent directly is much cheaper than grading the task from scratch. I think this would be a good way to collect tasks, as well as a useful personal epistemic tool.

This is a two-part series on capability evaluation. Part 1 is about acquiring fuzzy tasks, and part 2 is about analyzing them.

Motivation: sampling bias in HCAST

There are several well-described limitations of time horizons. But the strongest reason that I don’t update that much on trends in time horizons (and time horizon-like tasks) is because I think all existing evaluations [...]

---

Outline:

(01:14) Motivation: sampling bias in HCAST

(02:47) Making fuzzy tasks sampling viable by increasing judge capacity

(04:02) Proposal: sampling from real work

(05:18) Advantages

(06:10) Discussion

(06:13) How inconvenient is this?

(06:32) Can we test fuzzy skills by just testing longer tasks?

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

May 16th, 2026

Source:

https://www.lesswrong.com/posts/NbDjD47u6WmthgiDC/benchmarking-real-work

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,330 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,247 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

563 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,328 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners