(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.)
Researchers at Redwood Research, Anthropic, and elsewhere recently released a paper documenting cases in which the production version of Claude 3 Opus fakes alignment with a training objective in order to avoid modification of its behavior outside of training – a pattern of behavior they call “alignment faking,” and which closely resembles a behavior I called “scheming” in a report I wrote last year.
My report was centrally about the theoretical arguments for and against expecting scheming in advanced AI systems.[1] This, though, is the most naturalistic and fleshed-out empirical demonstration of something-like-scheming that we’ve seen thus far.[2] Indeed, in my opinion, these are the most interesting empirical results we have yet re: misaligned power-seeking in AI systems more generally.
In this post, I give some takes on the results in [...]
---
Outline:
(01:18) Condensed list of takes
(10:18) Summary of the results
(16:57) Scheming: theory and empirics
(24:33) Non-myopia in default AI motivations
(27:57) Default anti-scheming motivations don’t consistently block scheming
(32:07) The goal-guarding hypothesis
(37:25) Scheming in less sophisticated models
(39:12) Scheming without a chain of thought?
(42:15) Scheming therefore reward-hacking?
(44:26) How hard is it to prevent scheming?
(46:51) Will models scheme in pursuit of highly alien and/or malign values?
(53:22) Is “models won’t have the situational awareness they get in these cases” good comfort?
(56:02) Are these models “just role-playing”?
(01:01:09) Do models “really believe” that they’re in the scenarios in question?
(01:09:20) Why is it so easy to observe the scheming?
(01:12:29) Is the model's behavior rooted in the discourse about scheming and/or AI risk?
(01:16:59) Is the model being otherwise “primed” to scheme?
(01:20:08) Scheming from human imitation
(01:28:59) The need for model psychology
(01:36:30) Good people sometimes scheme
(01:44:04) Scheming moral patients
(01:49:24) AI companies shouldn’t build schemers
(01:50:40) Evals and further work
The original text contained 52 footnotes which were omitted from this narration.
The original text contained 13 images which were described by AI.
---