
Sign up to save your podcasts
Or


Recently, OpenAI announced their newest model, o3, achieving massive improvements over state-of-the-art on reasoning and math. The highlight of the announcement was the fact that o3 scored 25% on FrontierMath, a benchmark by Epoch AI ridiculously hard, unseen math problems of which previous models could only solve 2%. The events after the announcement, however, highlight that apart from OpenAI having the answer sheet before taking the exam, this was shady and lacked transparency in every possible way and has way broader implications for AI benchmarking, evaluations, and safety.
These are the important events that happened in chronological order:
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongRecently, OpenAI announced their newest model, o3, achieving massive improvements over state-of-the-art on reasoning and math. The highlight of the announcement was the fact that o3 scored 25% on FrontierMath, a benchmark by Epoch AI ridiculously hard, unseen math problems of which previous models could only solve 2%. The events after the announcement, however, highlight that apart from OpenAI having the answer sheet before taking the exam, this was shady and lacked transparency in every possible way and has way broader implications for AI benchmarking, evaluations, and safety.
These are the important events that happened in chronological order:
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,586 Listeners

130 Listeners

7,224 Listeners

531 Listeners

16,096 Listeners

4 Listeners

14 Listeners

2 Listeners