
Sign up to save your podcasts
Or
(As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)
When OpenAI first announced that o3 achieved 25% on FrontierMath, I was really freaked out. Next day, I asked Elliot Glazer, EpohAI's lead mathematician and the main developer of FrontierMath, what he thought. He said he was also shocked, and expected o3 to "crush the IMO" and get an easy gold, based on the fact that it got 25% on FrontierMath.
In retrospect, it really looks like we over-updated. While the public couldn't try o3 yet, we have access to o3-mini (high) now, which achieves 20% on FrontierMath given 8 tries, and gets 32% using a Python tool. This seems pretty close to o3's result, as we don't [...]
---
Outline:
(07:40) What is the purpose of benchmarks?
(12:12) How can a benchmark be more informative?
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
(As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)
When OpenAI first announced that o3 achieved 25% on FrontierMath, I was really freaked out. Next day, I asked Elliot Glazer, EpohAI's lead mathematician and the main developer of FrontierMath, what he thought. He said he was also shocked, and expected o3 to "crush the IMO" and get an easy gold, based on the fact that it got 25% on FrontierMath.
In retrospect, it really looks like we over-updated. While the public couldn't try o3 yet, we have access to o3-mini (high) now, which achieves 20% on FrontierMath given 8 tries, and gets 32% using a Python tool. This seems pretty close to o3's result, as we don't [...]
---
Outline:
(07:40) What is the purpose of benchmarks?
(12:12) How can a benchmark be more informative?
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,334 Listeners
2,389 Listeners
8,004 Listeners
4,120 Listeners
90 Listeners
1,494 Listeners
9,254 Listeners
91 Listeners
424 Listeners
5,448 Listeners
15,457 Listeners
506 Listeners
127 Listeners
71 Listeners
466 Listeners