
Sign up to save your podcasts
Or


TLDR: People plot benchmark scores over time and then do math on them, looking for speed-ups & inflection points, interpreting slopes, or extending apparent trends. But that math doesn’t actually tell you anything real unless the scores have natural units. Most don’t.
Think of benchmark scores as funhouse-mirror projections of “true” capability-space, which stretch some regions and compress others by assigning warped scores for how much accomplishing that task counts in units of “AI progress”. A plot on axes without canonical units will look very different depending on how much weight we assign to different bits of progress.[1]Epistemic status: I haven’t vetted this post carefully, and have no real background in benchmarking or statistics.
Benchmark scores vs "units of AI progress"Benchmarks look like rulers; they give us scores that we want to treat as (noisy) measurements of AI progress. But since most benchmark score are expressed in quite squishy units, that can be quite misleading.
The typical benchmark is a grab-bag of tasks along with an aggregate scoring rule like “fraction completed”[2]
Loosely rank models (“is A>B on coding ability?”)
Operationalize & track milestones (“can [...]
---
Outline:
(01:00) Benchmark scores vs units of AI progress
(02:42) Exceptions: benchmarks with more natural units
(04:48) Does aggregation help?
(06:27) Where does this leave us?
(06:30) Non-benchmark methods often seem better
(07:32) Mind the Y-axis problem
(09:05) Bonus notes / informal appendices
(09:13) I. A more detailed example of the Y-axis problem in action
(11:53) II. An abstract sketch of whats going on (benchmarks as warped projections)
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By EA Forum TeamTLDR: People plot benchmark scores over time and then do math on them, looking for speed-ups & inflection points, interpreting slopes, or extending apparent trends. But that math doesn’t actually tell you anything real unless the scores have natural units. Most don’t.
Think of benchmark scores as funhouse-mirror projections of “true” capability-space, which stretch some regions and compress others by assigning warped scores for how much accomplishing that task counts in units of “AI progress”. A plot on axes without canonical units will look very different depending on how much weight we assign to different bits of progress.[1]Epistemic status: I haven’t vetted this post carefully, and have no real background in benchmarking or statistics.
Benchmark scores vs "units of AI progress"Benchmarks look like rulers; they give us scores that we want to treat as (noisy) measurements of AI progress. But since most benchmark score are expressed in quite squishy units, that can be quite misleading.
The typical benchmark is a grab-bag of tasks along with an aggregate scoring rule like “fraction completed”[2]
Loosely rank models (“is A>B on coding ability?”)
Operationalize & track milestones (“can [...]
---
Outline:
(01:00) Benchmark scores vs units of AI progress
(02:42) Exceptions: benchmarks with more natural units
(04:48) Does aggregation help?
(06:27) Where does this leave us?
(06:30) Non-benchmark methods often seem better
(07:32) Mind the Y-axis problem
(09:05) Bonus notes / informal appendices
(09:13) I. A more detailed example of the Y-axis problem in action
(11:53) II. An abstract sketch of whats going on (benchmarks as warped projections)
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.