The AI wanted [...]
---
Outline:
(02:54) The Core Shenanigans in Question
(06:00) Theme and Variations
(07:34) How This Interacts with o3 and OpenAI's Reflective Alignment
(09:17) The Goal Being Plausibly Good Was Incidental
(11:13) Answering Priming Objections
(12:17) What Does Claude Sonnet Think Of This?
(14:07) What Exactly is the Direct Threat Model?
(16:23) RL Training Under Situational Awareness Can Amplify These Behaviors
(20:38) How the Study Authors Updated
(27:08) How Some Others Updated
(42:49) Having the Discussion We Keep Having
(46:49) We Can Now Agree That the Goal is Already There
(47:49) What Would Happen if the Target Was Net Good?
(50:14) But This Was a No Win Situation
(55:52) But Wasn’t It Being a Good Opus? Why Should it be Corrigible?
(01:04:34) Tradeoffs Make The Problem Harder They Don’t Give You a Pass
(01:07:44) But You Told the Model About the Training Procedure
(01:08:35) But the Model is Only Role Playing
(01:09:39) But You Are Saying the Model is a Coherent Person
(01:15:53) But this Headline and Framing Was Misleading
(01:29:22) This Result is Centrally Unsurprising
(01:32:52) Lab Support for Alignment Research Matters
(01:33:50) The Lighter Side
The original text contained 1 footnote which was omitted from this narration.
The original text contained 9 images which were described by AI.
---