March 11, 2026

“GPT-5.4 Is A Substantial Upgrade” by Zvi

54 minutes

Benchmarks have never been less useful for telling us which models are best.

They are good for giving a general sense of the landscape. They definitely paint a picture. But if you’re comparing top models, like GPT-5.4 against Opus 4.6 against Gemini 3.1 Pro, you have to use the models, talk to the models, get reports from those who have and form a gestalt. The reports will contract each other and you have to work through that. There's no other way.

Thus, I try to gather and sort a reasonably comprehensive set of reactions, so you can browse the sections that make you most curious.

The gestalt is that GPT-5.4 is a very good model, sir. It's a substantial upgrade from GPT-5.2, and also from 5.3-Codex, and it puts OpenAI back in the game, whereas I felt like Opus 4.6 dominated OpenAI's previous offerings for all but narrow uses.

Each lab's models vary and things change over time, but they tend to have consistent strengths, weaknesses and personalities. From what I’ve seen this is very much an OpenAI model. It's highly capable, and it is especially seen as a big improvement by the whisperers and [...]

---

Outline:

(01:42) The Big Take

(04:24) The Official Pitch

(08:43) Other Peoples Benchmarks

(12:44) The System Card

(17:22) Preparedness Framework

(19:15) Fun Experiments

(19:41) Early Poll Results

(21:48) Positive Reactions

(34:47) Vibe Coders Only

(37:26) Fill Out Your Roster

(37:51) Intent Wins

(40:41) Personality Clash

(45:44) Model Relations Department

(49:27) Stylistic Differences

(50:02) Some Will Always Be Unimpressed

(53:57) The Lighter Side

---

First published:

March 11th, 2026

Source:

https://www.lesswrong.com/posts/sKCYLEN5EYLuokDft/gpt-5-4-is-a-substantial-upgrade

---

Narrated by TYPE III AUDIO.

---

Images from the article:

GPT" at 9.4%, "Claude -> Claude" at 37.9%, "GPT -> GPT" at 13.6%, and "Other / See Results" at 39.1%. The poll has 683 votes with 1 day left." style="max-width: 100%;" /> GPT" at 6%, "Claude -> Claude" at 38.2%, "GPT -> GPT" at 15.3%, and "Other / See Results" at 40.6%. The poll received 419 votes and shows final results." style="max-width: 100%;" />

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more