February 06, 2026

“AI benchmarking has a Y-axis problem” by Lizka

14 minutes

TLDR: People plot benchmark scores over time and then do math on them, looking for speed-ups & inflection points, interpreting slopes, or extending apparent trends. But that math doesn’t actually tell you anything real unless the scores have natural units. Most don’t.

Think of benchmark scores as funhouse-mirror projections of “true” capability-space, which stretch some regions and compress others by assigning warped scores for how much accomplishing that task counts in units of “AI progress”. A plot on axes without canonical units will look very different depending on how much weight we assign to different bits of progress.[1]

Epistemic status: I haven’t vetted this post carefully, and have no real background in benchmarking or statistics.

Benchmark scores vs "units of AI progress"

Benchmarks look like rulers; they give us scores that we want to treat as (noisy) measurements of AI progress. But since most benchmark score are expressed in quite squishy units, that can be quite misleading.

The typical benchmark is a grab-bag of tasks along with an aggregate scoring rule like “fraction completed”[2]
✅ Scores like this can help us...
- Loosely rank models (“is A>B on coding ability?”)
- Operationalize & track milestones (“can [...]

---

Outline:

(01:00) Benchmark scores vs units of AI progress

(02:42) Exceptions: benchmarks with more natural units

(04:48) Does aggregation help?

(06:27) Where does this leave us?

(06:30) Non-benchmark methods often seem better

(07:32) Mind the Y-axis problem

(09:05) Bonus notes / informal appendices

(09:13) I. A more detailed example of the Y-axis problem in action

(11:53) II. An abstract sketch of whats going on (benchmarks as warped projections)

The original text contained 18 footnotes which were omitted from this narration.

---

First published:

February 6th, 2026

Source:

https://www.lesswrong.com/posts/EWfGf8qA7ZZifEAxG/ai-benchmarking-has-a-y-axis-problem-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

February 06, 2026

“AI benchmarking has a Y-axis problem” by Lizka

14 minutes

Epistemic status: I haven’t vetted this post carefully, and have no real background in benchmarking or statistics.

Benchmark scores vs "units of AI progress"

The typical benchmark is a grab-bag of tasks along with an aggregate scoring rule like “fraction completed”[2]
✅ Scores like this can help us...
- Loosely rank models (“is A>B on coding ability?”)
- Operationalize & track milestones (“can [...]

---

Outline:

(01:00) Benchmark scores vs units of AI progress

(02:42) Exceptions: benchmarks with more natural units

(04:48) Does aggregation help?

(06:27) Where does this leave us?

(06:30) Non-benchmark methods often seem better

(07:32) Mind the Y-axis problem

(09:05) Bonus notes / informal appendices

(09:13) I. A more detailed example of the Y-axis problem in action

(11:53) II. An abstract sketch of whats going on (benchmarks as warped projections)

The original text contained 18 footnotes which were omitted from this narration.

---

First published:

February 6th, 2026

Source:

https://www.lesswrong.com/posts/EWfGf8qA7ZZifEAxG/ai-benchmarking-has-a-y-axis-problem-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,309 Listeners

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat

7,241 Listeners

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show

16,305 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “AI benchmarking has a Y-axis problem” by Lizka

Sign up to save your podcasts

“AI benchmarking has a Y-axis problem” by Lizka

“AI benchmarking has a Y-axis problem” by Lizka

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi