LessWrong (30+ Karma)

“AI benchmarking has a Y-axis problem” by Lizka


Listen Later

TLDR: People plot benchmark scores over time and then do math on them, looking for speed-ups & inflection points, interpreting slopes, or extending apparent trends. But that math doesn’t actually tell you anything real unless the scores have natural units. Most don’t.

Think of benchmark scores as funhouse-mirror projections of “true” capability-space, which stretch some regions and compress others by assigning warped scores for how much accomplishing that task counts in units of “AI progress”. A plot on axes without canonical units will look very different depending on how much weight we assign to different bits of progress.[1]

Epistemic status: I haven’t vetted this post carefully, and have no real background in benchmarking or statistics.

Benchmark scores vs "units of AI progress"

Benchmarks look like rulers; they give us scores that we want to treat as (noisy) measurements of AI progress. But since most benchmark score are expressed in quite squishy units, that can be quite misleading.

  • The typical benchmark is a grab-bag of tasks along with an aggregate scoring rule like “fraction completed”[2]

  • ✅ Scores like this can help us...
    • Loosely rank models (“is A>B on coding ability?”)

    • Operationalize & track milestones (“can [...]

---

Outline:

(01:00) Benchmark scores vs units of AI progress

(02:42) Exceptions: benchmarks with more natural units

(04:48) Does aggregation help?

(06:27) Where does this leave us?

(06:30) Non-benchmark methods often seem better

(07:32) Mind the Y-axis problem

(09:05) Bonus notes / informal appendices

(09:13) I. A more detailed example of the Y-axis problem in action

(11:53) II. An abstract sketch of whats going on (benchmarks as warped projections)

The original text contained 18 footnotes which were omitted from this narration.

---

First published:

February 6th, 2026

Source:

https://www.lesswrong.com/posts/EWfGf8qA7ZZifEAxG/ai-benchmarking-has-a-y-axis-problem-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

113,122 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,266 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

529 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,315 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners