LessWrong (30+ Karma)

“FrontierMath Score of o3-mini Much Lower Than Claimed” by YafahEdelman


Listen Later

OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%.

There are a few reasons to trust Epoch's score over OpenAIs:

  • Epoch built the benchmark and has better incentives.
  • OpenAI reported a 28% score on the hardest of the three problem tiers - suspiciously close to their overall score.
  • Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.
  1. ^

    Which had Python access.

The original text contained 1 footnote which was omitted from this narration.

---

First published:

March 17th, 2025

Source:

https://www.lesswrong.com/posts/z8zPL2hBqTmx7Kf6J/frontiermath-score-of-o3-mini-much-lower-than-claimed

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

113,129 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,262 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

561 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,487 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners