Disclaimer: this post was not written by me, but by a friend who wishes to remain anonymous. I did some editing, however.
So a recent post my friend wrote has made the point quite clearly (I hope) that LLM performance on the simple task of playing and winning a game of Pokémon Red is highly dependent on the scaffold and tooling provided. In a way, this is not surprising—the scaffold is there to address limitations in what the model can do, and paper over things like lack of long-term context, executive function, etc.
But the thing is, I thought I knew that, and then I actually tried to run Pokémon Red.
A Casual Research Narrative
The underlying code is the basic Claude scaffold provided by David Hershey of Anthropic.[1] I first simply let Claude 3.7 run on it for a bit, making observations about what I thought might generate [...]
---
Outline:
(01:04) A Casual Research Narrative
(09:10) An only somewhat sorted list of observations about all this
(09:22) Model Vision of Pokémon Red is Bad. Really Bad.
(12:58) Models Cant Remember
(14:34) Spatial Reasoning? Whats that?
(15:22) A Grasp on Reality
(18:40) Costs
(19:13) Why do this at all?
(19:47) So which model is better?
(22:36) Miscellanea: ClaudePlaysPokemon Derp Anecdotes
The original text contained 6 footnotes which were omitted from this narration.
---