
Sign up to save your podcasts
Or


(As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)
When OpenAI first announced that o3 achieved 25% on FrontierMath, I was really freaked out. Next day, I asked Elliot Glazer, EpohAI's lead mathematician and the main developer of FrontierMath, what he thought. He said he was also shocked, and expected o3 to "crush the IMO" and get an easy gold, based on the fact that it got 25% on FrontierMath.
In retrospect, it really looks like we over-updated. While the public couldn't try o3 yet, we have access to o3-mini (high) now, which achieves 20% on FrontierMath given 8 tries, and gets 32% using a Python tool. This seems pretty close to o3's result, as we don't [...]
---
Outline:
(07:40) What is the purpose of benchmarks?
(12:12) How can a benchmark be more informative?
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrong(As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)
When OpenAI first announced that o3 achieved 25% on FrontierMath, I was really freaked out. Next day, I asked Elliot Glazer, EpohAI's lead mathematician and the main developer of FrontierMath, what he thought. He said he was also shocked, and expected o3 to "crush the IMO" and get an easy gold, based on the fact that it got 25% on FrontierMath.
In retrospect, it really looks like we over-updated. While the public couldn't try o3 yet, we have access to o3-mini (high) now, which achieves 20% on FrontierMath given 8 tries, and gets 32% using a Python tool. This seems pretty close to o3's result, as we don't [...]
---
Outline:
(07:40) What is the purpose of benchmarks?
(12:12) How can a benchmark be more informative?
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.

26,365 Listeners

2,443 Listeners

9,083 Listeners

4,156 Listeners

92 Listeners

1,595 Listeners

9,907 Listeners

90 Listeners

507 Listeners

5,468 Listeners

16,056 Listeners

540 Listeners

132 Listeners

95 Listeners

521 Listeners