
Sign up to save your podcasts
Or


Victor Taelin of Higher Order Company has some of the hardest computer science problems the LLMs most likely have never seen before and evaluated Gemini 3 on them. Here is his tweet reproduced almost in full.
Short Version
First of all: you've all seen the benchmarks, so I don't think you need me to judge this one. Still, based on my tests, this is as real as it gets, and I want to talk about it. This model outperforms GPT-5 Pro, Gemini 2.5 Deep Think, and everything else, on my hardest problems, by far.
It is the new SOTA at:
→ debugging complex compiler bugs
→ refactoring files without logical mistakes
→ solving difficult λ-calculus problems
→ ASCII art (it is almost decent now!)
→ Competitive Gen 3 OU (won't elaborate 😭)
It is still an LLM, though. It has similar failure modes, and is worse than Sonnet / GPT-5 in some scenarios.
It seems very bad at:
→ inferring intent
→ not going overboard
→ one-shot vibe coding
→ creative writing
→ health questions
Also, I suspect this checkpoint isn't the best Google has.
Now, on to a complete, manually typed Gemini 3 overview.
Long Version
1. Vibe [...]
---
Outline:
(00:25) Short Version
(01:53) Long Version
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongVictor Taelin of Higher Order Company has some of the hardest computer science problems the LLMs most likely have never seen before and evaluated Gemini 3 on them. Here is his tweet reproduced almost in full.
Short Version
First of all: you've all seen the benchmarks, so I don't think you need me to judge this one. Still, based on my tests, this is as real as it gets, and I want to talk about it. This model outperforms GPT-5 Pro, Gemini 2.5 Deep Think, and everything else, on my hardest problems, by far.
It is the new SOTA at:
→ debugging complex compiler bugs
→ refactoring files without logical mistakes
→ solving difficult λ-calculus problems
→ ASCII art (it is almost decent now!)
→ Competitive Gen 3 OU (won't elaborate 😭)
It is still an LLM, though. It has similar failure modes, and is worse than Sonnet / GPT-5 in some scenarios.
It seems very bad at:
→ inferring intent
→ not going overboard
→ one-shot vibe coding
→ creative writing
→ health questions
Also, I suspect this checkpoint isn't the best Google has.
Now, on to a complete, manually typed Gemini 3 overview.
Long Version
1. Vibe [...]
---
Outline:
(00:25) Short Version
(01:53) Long Version
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

112,956 Listeners

132 Listeners

7,290 Listeners

548 Listeners

16,362 Listeners

4 Listeners

14 Listeners

2 Listeners