
Sign up to save your podcasts
Or
(As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)
When OpenAI first announced that o3 achieved 25% on FrontierMath, I was really freaked out. Next day, I asked Elliot Glazer, EpohAI's lead mathematician and the main developer of FrontierMath, what he thought. He said he was also shocked, and expected o3 to "crush the IMO" and get an easy gold, based on the fact that it got 25% on FrontierMath.
In retrospect, it really looks like we over-updated. While the public couldn't try o3 yet, we have access to o3-mini (high) now, which achieves 20% on FrontierMath given 8 tries, and gets 32% using a Python tool. This seems pretty close to o3's result, as we don't [...]
---
Outline:
(07:40) What is the purpose of benchmarks?
(12:12) How can a benchmark be more informative?
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
(As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)
When OpenAI first announced that o3 achieved 25% on FrontierMath, I was really freaked out. Next day, I asked Elliot Glazer, EpohAI's lead mathematician and the main developer of FrontierMath, what he thought. He said he was also shocked, and expected o3 to "crush the IMO" and get an easy gold, based on the fact that it got 25% on FrontierMath.
In retrospect, it really looks like we over-updated. While the public couldn't try o3 yet, we have access to o3-mini (high) now, which achieves 20% on FrontierMath given 8 tries, and gets 32% using a Python tool. This seems pretty close to o3's result, as we don't [...]
---
Outline:
(07:40) What is the purpose of benchmarks?
(12:12) How can a benchmark be more informative?
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,358 Listeners
2,397 Listeners
7,818 Listeners
4,111 Listeners
87 Listeners
1,455 Listeners
8,768 Listeners
90 Listeners
354 Listeners
5,356 Listeners
15,019 Listeners
463 Listeners
128 Listeners
65 Listeners
432 Listeners