March 18, 2025

“FrontierMath Score of o3-mini Much Lower Than Claimed” by YafahEdelman

1 minute

OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%.

There are a few reasons to trust Epoch's score over OpenAIs:

Epoch built the benchmark and has better incentives.
OpenAI reported a 28% score on the hardest of the three problem tiers - suspiciously close to their overall score.
Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.

^
Which had Python access.

The original text contained 1 footnote which was omitted from this narration.

---

First published:

March 17th, 2025

Source:

https://www.lesswrong.com/posts/z8zPL2hBqTmx7Kf6J/frontiermath-score-of-o3-mini-much-lower-than-claimed

---

Narrated by TYPE III AUDIO.

...more

View all episodes

By LessWrong

March 18, 2025

“FrontierMath Score of o3-mini Much Lower Than Claimed” by YafahEdelman

1 minute

OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%.

There are a few reasons to trust Epoch's score over OpenAIs:

Epoch built the benchmark and has better incentives.
OpenAI reported a 28% score on the hardest of the three problem tiers - suspiciously close to their overall score.
Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.