
Sign up to save your podcasts
Or


Victor Taelin of Higher Order Company has some of the hardest computer science problems the LLMs most likely have never seen before and evaluated Gemini 3 on them. Here is his tweet reproduced almost in full.
Short Version
First of all: you've all seen the benchmarks, so I don't think you need me to judge this one. Still, based on my tests, this is as real as it gets, and I want to talk about it. This model outperforms GPT-5 Pro, Gemini 2.5 Deep Think, and everything else, on my hardest problems, by far.
It is the new SOTA at:
→ debugging complex compiler bugs
→ refactoring files without logical mistakes
→ solving difficult λ-calculus problems
→ ASCII art (it is almost decent now!)
→ Competitive Gen 3 OU (won't elaborate 😭)
It is still an LLM, though. It has similar failure modes, and is worse than Sonnet / GPT-5 in some scenarios.
It seems very bad at:
→ inferring intent
→ not going overboard
→ one-shot vibe coding
→ creative writing
→ health questions
Also, I suspect this checkpoint isn't the best Google has.
Now, on to a complete, manually typed Gemini 3 overview.
Long Version
1. Vibe [...]
---
Outline:
(00:25) Short Version
(01:53) Long Version
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongVictor Taelin of Higher Order Company has some of the hardest computer science problems the LLMs most likely have never seen before and evaluated Gemini 3 on them. Here is his tweet reproduced almost in full.
Short Version
First of all: you've all seen the benchmarks, so I don't think you need me to judge this one. Still, based on my tests, this is as real as it gets, and I want to talk about it. This model outperforms GPT-5 Pro, Gemini 2.5 Deep Think, and everything else, on my hardest problems, by far.
It is the new SOTA at:
→ debugging complex compiler bugs
→ refactoring files without logical mistakes
→ solving difficult λ-calculus problems
→ ASCII art (it is almost decent now!)
→ Competitive Gen 3 OU (won't elaborate 😭)
It is still an LLM, though. It has similar failure modes, and is worse than Sonnet / GPT-5 in some scenarios.
It seems very bad at:
→ inferring intent
→ not going overboard
→ one-shot vibe coding
→ creative writing
→ health questions
Also, I suspect this checkpoint isn't the best Google has.
Now, on to a complete, manually typed Gemini 3 overview.
Long Version
1. Vibe [...]
---
Outline:
(00:25) Short Version
(01:53) Long Version
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

26,330 Listeners

2,453 Listeners

8,557 Listeners

4,182 Listeners

93 Listeners

1,601 Listeners

9,927 Listeners

95 Listeners

511 Listeners

5,512 Listeners

15,931 Listeners

545 Listeners

131 Listeners

94 Listeners

467 Listeners