
Sign up to save your podcasts
Or


Benchmarks have never been less useful for telling us which models are best.
They are good for giving a general sense of the landscape. They definitely paint a picture. But if you’re comparing top models, like GPT-5.4 against Opus 4.6 against Gemini 3.1 Pro, you have to use the models, talk to the models, get reports from those who have and form a gestalt. The reports will contract each other and you have to work through that. There's no other way.
Thus, I try to gather and sort a reasonably comprehensive set of reactions, so you can browse the sections that make you most curious.
The gestalt is that GPT-5.4 is a very good model, sir. It's a substantial upgrade from GPT-5.2, and also from 5.3-Codex, and it puts OpenAI back in the game, whereas I felt like Opus 4.6 dominated OpenAI's previous offerings for all but narrow uses.
Each lab's models vary and things change over time, but they tend to have consistent strengths, weaknesses and personalities. From what I’ve seen this is very much an OpenAI model. It's highly capable, and it is especially seen as a big improvement by the whisperers and [...]
---
Outline:
(01:42) The Big Take
(04:24) The Official Pitch
(08:43) Other Peoples Benchmarks
(12:44) The System Card
(17:22) Preparedness Framework
(19:15) Fun Experiments
(19:41) Early Poll Results
(21:48) Positive Reactions
(34:47) Vibe Coders Only
(37:26) Fill Out Your Roster
(37:51) Intent Wins
(40:41) Personality Clash
(45:44) Model Relations Department
(49:27) Stylistic Differences
(50:02) Some Will Always Be Unimpressed
(53:57) The Lighter Side
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
GPT" at 9.4%, "Claude -> Claude" at 37.9%, "GPT -> GPT" at 13.6%, and "Other / See Results" at 39.1%. The poll has 683 votes with 1 day left." style="max-width: 100%;" /> GPT" at 6%, "Claude -> Claude" at 38.2%, "GPT -> GPT" at 15.3%, and "Other / See Results" at 40.6%. The poll received 419 votes and shows final results." style="max-width: 100%;" />Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By zvi5
22 ratings
Benchmarks have never been less useful for telling us which models are best.
They are good for giving a general sense of the landscape. They definitely paint a picture. But if you’re comparing top models, like GPT-5.4 against Opus 4.6 against Gemini 3.1 Pro, you have to use the models, talk to the models, get reports from those who have and form a gestalt. The reports will contract each other and you have to work through that. There's no other way.
Thus, I try to gather and sort a reasonably comprehensive set of reactions, so you can browse the sections that make you most curious.
The gestalt is that GPT-5.4 is a very good model, sir. It's a substantial upgrade from GPT-5.2, and also from 5.3-Codex, and it puts OpenAI back in the game, whereas I felt like Opus 4.6 dominated OpenAI's previous offerings for all but narrow uses.
Each lab's models vary and things change over time, but they tend to have consistent strengths, weaknesses and personalities. From what I’ve seen this is very much an OpenAI model. It's highly capable, and it is especially seen as a big improvement by the whisperers and [...]
---
Outline:
(01:42) The Big Take
(04:24) The Official Pitch
(08:43) Other Peoples Benchmarks
(12:44) The System Card
(17:22) Preparedness Framework
(19:15) Fun Experiments
(19:41) Early Poll Results
(21:48) Positive Reactions
(34:47) Vibe Coders Only
(37:26) Fill Out Your Roster
(37:51) Intent Wins
(40:41) Personality Clash
(45:44) Model Relations Department
(49:27) Stylistic Differences
(50:02) Some Will Always Be Unimpressed
(53:57) The Lighter Side
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
GPT" at 9.4%, "Claude -> Claude" at 37.9%, "GPT -> GPT" at 13.6%, and "Other / See Results" at 39.1%. The poll has 683 votes with 1 day left." style="max-width: 100%;" /> GPT" at 6%, "Claude -> Claude" at 38.2%, "GPT -> GPT" at 15.3%, and "Other / See Results" at 40.6%. The poll received 419 votes and shows final results." style="max-width: 100%;" />Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,380 Listeners

2,461 Listeners

1,105 Listeners

109 Listeners

291 Listeners

90 Listeners

551 Listeners

5,576 Listeners

137 Listeners

13 Listeners

150 Listeners

147 Listeners

475 Listeners

0 Listeners

143 Listeners