
Sign up to save your podcasts
Or
(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.)
Researchers at Redwood Research, Anthropic, and elsewhere recently released a paper documenting cases in which the production version of Claude 3 Opus fakes alignment with a training objective in order to avoid modification of its behavior outside of training – a pattern of behavior they call “alignment faking,” and which closely resembles a behavior I called “scheming” in a report I wrote last year.
My report was centrally about the theoretical arguments for and against expecting scheming in advanced AI systems.[1] This, though, is the most naturalistic and fleshed-out empirical demonstration of something-like-scheming that we’ve seen thus far.[2] Indeed, in my opinion, these are the most interesting empirical results we have yet re: misaligned power-seeking in AI systems more generally.
In this post, I give some takes on the results in [...]
---
Outline:
(01:18) Condensed list of takes
(10:18) Summary of the results
(16:57) Scheming: theory and empirics
(24:33) Non-myopia in default AI motivations
(27:57) Default anti-scheming motivations don’t consistently block scheming
(32:07) The goal-guarding hypothesis
(37:25) Scheming in less sophisticated models
(39:12) Scheming without a chain of thought?
(42:15) Scheming therefore reward-hacking?
(44:26) How hard is it to prevent scheming?
(46:51) Will models scheme in pursuit of highly alien and/or malign values?
(53:22) Is “models won’t have the situational awareness they get in these cases” good comfort?
(56:02) Are these models “just role-playing”?
(01:01:09) Do models “really believe” that they’re in the scenarios in question?
(01:09:20) Why is it so easy to observe the scheming?
(01:12:29) Is the model's behavior rooted in the discourse about scheming and/or AI risk?
(01:16:59) Is the model being otherwise “primed” to scheme?
(01:20:08) Scheming from human imitation
(01:28:59) The need for model psychology
(01:36:30) Good people sometimes scheme
(01:44:04) Scheming moral patients
(01:49:24) AI companies shouldn’t build schemers
(01:50:40) Evals and further work
The original text contained 52 footnotes which were omitted from this narration.
The original text contained 13 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.)
Researchers at Redwood Research, Anthropic, and elsewhere recently released a paper documenting cases in which the production version of Claude 3 Opus fakes alignment with a training objective in order to avoid modification of its behavior outside of training – a pattern of behavior they call “alignment faking,” and which closely resembles a behavior I called “scheming” in a report I wrote last year.
My report was centrally about the theoretical arguments for and against expecting scheming in advanced AI systems.[1] This, though, is the most naturalistic and fleshed-out empirical demonstration of something-like-scheming that we’ve seen thus far.[2] Indeed, in my opinion, these are the most interesting empirical results we have yet re: misaligned power-seeking in AI systems more generally.
In this post, I give some takes on the results in [...]
---
Outline:
(01:18) Condensed list of takes
(10:18) Summary of the results
(16:57) Scheming: theory and empirics
(24:33) Non-myopia in default AI motivations
(27:57) Default anti-scheming motivations don’t consistently block scheming
(32:07) The goal-guarding hypothesis
(37:25) Scheming in less sophisticated models
(39:12) Scheming without a chain of thought?
(42:15) Scheming therefore reward-hacking?
(44:26) How hard is it to prevent scheming?
(46:51) Will models scheme in pursuit of highly alien and/or malign values?
(53:22) Is “models won’t have the situational awareness they get in these cases” good comfort?
(56:02) Are these models “just role-playing”?
(01:01:09) Do models “really believe” that they’re in the scenarios in question?
(01:09:20) Why is it so easy to observe the scheming?
(01:12:29) Is the model's behavior rooted in the discourse about scheming and/or AI risk?
(01:16:59) Is the model being otherwise “primed” to scheme?
(01:20:08) Scheming from human imitation
(01:28:59) The need for model psychology
(01:36:30) Good people sometimes scheme
(01:44:04) Scheming moral patients
(01:49:24) AI companies shouldn’t build schemers
(01:50:40) Evals and further work
The original text contained 52 footnotes which were omitted from this narration.
The original text contained 13 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,342 Listeners
2,393 Listeners
7,949 Listeners
4,130 Listeners
87 Listeners
1,446 Listeners
8,910 Listeners
88 Listeners
372 Listeners
5,421 Listeners
15,334 Listeners
468 Listeners
122 Listeners
76 Listeners
449 Listeners