The episode opens with a long discussion of OpenAI's Strawberry / O1-style reasoning models. Andrew Mayne explains that these models seem to work better when asked to break problems into steps, use tools, and reason through tasks in a more structured way than ordinary one-shot chat models. The hosts compare this to prompt engineering, discuss examples like decimal comparisons and counting the R's in "strawberry," and talk about how longer structured prompts, patience, and using the right model for the right task can improve results. Later, the conversation broadens into AI evaluations, benchmark gaming, model stacking, tool use, and concerns about AI persuasion. Andrew argues that leaderboard results can be misleading and that models often look strong in short tests but deteriorate with longer contexts, while Justin notes that eval methods themselves are still immature. They also discuss a Science paper about GPT-4 Turbo persuading people away from conspiracy beliefs, which Andrew frames as manipulative and alarming. The episode then moves into a playful Matrix screening story, a discussion of Polaris Dawn and private spacewalking, and the show ends with Netflix media picks. Key topics Reasoning models as step-by-step task solvers: Andrew describes Strawberry / O1 as a model that performs best on long, detailed, multi-step tasks, especially when asked to break work into steps and think through a problem. Prompt engineering for better outputs: The hosts discuss writing longer