March 13, 2025

“Don’t over-update on FrontierMath results” by David Matolcsi

16 minutes

(As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)

When OpenAI first announced that o3 achieved 25% on FrontierMath, I was really freaked out. Next day, I asked Elliot Glazer, EpohAI's lead mathematician and the main developer of FrontierMath, what he thought. He said he was also shocked, and expected o3 to "crush the IMO" and get an easy gold, based on the fact that it got 25% on FrontierMath.

In retrospect, it really looks like we over-updated. While the public couldn't try o3 yet, we have access to o3-mini (high) now, which achieves 20% on FrontierMath given 8 tries, and gets 32% using a Python tool. This seems pretty close to o3's result, as we don't [...]

---

Outline:

(07:40) What is the purpose of benchmarks?

(12:12) How can a benchmark be more informative?

The original text contained 12 footnotes which were omitted from this narration.

---

First published:

March 11th, 2025

Source:

https://www.lesswrong.com/posts/9HfJbFy3ZZGzNsspw/don-t-over-update-on-frontiermath-results

---

Narrated by TYPE III AUDIO.