
Sign up to save your podcasts
Or


o3 isn’t one of the biggest developments in AI for 2+ years because it beats a particular benchmark. It is so because it demonstrates a reusable technique through which almost any benchmark could fall, and at short notice. I’ll cover all the highlights, benchmarks broken, and what comes next. Plus, the costs OpenAI didn’t want us to know, Genesis, ARC-AGI 2, Gemini-Thinking, and much more.
FrontierMath: https://epoch.ai/frontiermath
https://arxiv.org/pdf/2411.04872
Chollet Statement:https://arcprize.org/blog/oai-o3-pub-breakthrough
MLC Paper:
https://www.scientificamerican.com/article/new-training-method-helps-ai-generalize-like-people-do/?utm_campaign=socialflow&utm_source=twitter&utm_medium=social
AlphaCode 2: https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf
Human Performance on ARC-AGI: https://arxiv.org/pdf/2409.01374v1
Wei Tweet ‘3 months’:https://x.com/_jasonwei/status/1870184982007644614
Deliberative Alignment Paper: https://openai.com/index/deliberative-alignment/
Brown Safety Tweet: https://x.com/polynoamial/status/1870196476908834893
Swe-Bench Verified: https://openai.com/index/introducing-swe-bench-verified/
Amodei Prediction: https://x.com/OfirPress/status/1858567863788769518
David Dohan: 16 hours https://x.com/dmdohan/status/1870171404093796638
OpenAI Personal Writing: https://openai.com/index/learning-to-reason-with-llms/
https://simple-bench.com/
John Hallman Tweet: https://x.com/johnohallman/status/1870233375681945725
00:00 - Introduction
01:19 - What is o3?
03:18 - FrontierMath
05:15 - o4, o5
06:03 - GPQA
06:24 - Coding, Codeforces + SWE-verified, AlphaCode 2
08:13 - 1st Caveat
09:03 - Compositionality?
10:16 - SimpleBench?
13:11 - ARC-AGI, Chollet
By Philip - Host of AI Explained YT3.1
99 ratings
o3 isn’t one of the biggest developments in AI for 2+ years because it beats a particular benchmark. It is so because it demonstrates a reusable technique through which almost any benchmark could fall, and at short notice. I’ll cover all the highlights, benchmarks broken, and what comes next. Plus, the costs OpenAI didn’t want us to know, Genesis, ARC-AGI 2, Gemini-Thinking, and much more.
FrontierMath: https://epoch.ai/frontiermath
https://arxiv.org/pdf/2411.04872
Chollet Statement:https://arcprize.org/blog/oai-o3-pub-breakthrough
MLC Paper:
https://www.scientificamerican.com/article/new-training-method-helps-ai-generalize-like-people-do/?utm_campaign=socialflow&utm_source=twitter&utm_medium=social
AlphaCode 2: https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf
Human Performance on ARC-AGI: https://arxiv.org/pdf/2409.01374v1
Wei Tweet ‘3 months’:https://x.com/_jasonwei/status/1870184982007644614
Deliberative Alignment Paper: https://openai.com/index/deliberative-alignment/
Brown Safety Tweet: https://x.com/polynoamial/status/1870196476908834893
Swe-Bench Verified: https://openai.com/index/introducing-swe-bench-verified/
Amodei Prediction: https://x.com/OfirPress/status/1858567863788769518
David Dohan: 16 hours https://x.com/dmdohan/status/1870171404093796638
OpenAI Personal Writing: https://openai.com/index/learning-to-reason-with-llms/
https://simple-bench.com/
John Hallman Tweet: https://x.com/johnohallman/status/1870233375681945725
00:00 - Introduction
01:19 - What is o3?
03:18 - FrontierMath
05:15 - o4, o5
06:03 - GPQA
06:24 - Coding, Codeforces + SWE-verified, AlphaCode 2
08:13 - 1st Caveat
09:03 - Compositionality?
10:16 - SimpleBench?
13:11 - ARC-AGI, Chollet

346 Listeners

201 Listeners

309 Listeners

98 Listeners

531 Listeners

512 Listeners

5,547 Listeners

141 Listeners

99 Listeners

226 Listeners

637 Listeners

106 Listeners

403 Listeners

99 Listeners

150 Listeners