December 18, 2024

“Takes on ‘Alignment Faking in Large Language Models’” by Joe Carlsmith

1 hour 54 minutes

(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.)

Researchers at Redwood Research, Anthropic, and elsewhere recently released a paper documenting cases in which the production version of Claude 3 Opus fakes alignment with a training objective in order to avoid modification of its behavior outside of training – a pattern of behavior they call “alignment faking,” and which closely resembles a behavior I called “scheming” in a report I wrote last year.

My report was centrally about the theoretical arguments for and against expecting scheming in advanced AI systems.[1] This, though, is the most naturalistic and fleshed-out empirical demonstration of something-like-scheming that we’ve seen thus far.[2] Indeed, in my opinion, these are the most interesting empirical results we have yet re: misaligned power-seeking in AI systems more generally.

In this post, I give some takes on the results in [...]

---

Outline:

(01:18) Condensed list of takes

(10:18) Summary of the results

(16:57) Scheming: theory and empirics

(24:33) Non-myopia in default AI motivations

(27:57) Default anti-scheming motivations don’t consistently block scheming

(32:07) The goal-guarding hypothesis

(37:25) Scheming in less sophisticated models

(39:12) Scheming without a chain of thought?

(42:15) Scheming therefore reward-hacking?

(44:26) How hard is it to prevent scheming?

(46:51) Will models scheme in pursuit of highly alien and/or malign values?

(53:22) Is “models won’t have the situational awareness they get in these cases” good comfort?

(56:02) Are these models “just role-playing”?

(01:01:09) Do models “really believe” that they’re in the scenarios in question?

(01:09:20) Why is it so easy to observe the scheming?

(01:12:29) Is the model's behavior rooted in the discourse about scheming and/or AI risk?

(01:16:59) Is the model being otherwise “primed” to scheme?

(01:20:08) Scheming from human imitation

(01:28:59) The need for model psychology

(01:36:30) Good people sometimes scheme

(01:44:04) Scheming moral patients

(01:49:24) AI companies shouldn’t build schemers

(01:50:40) Evals and further work

The original text contained 52 footnotes which were omitted from this narration.

The original text contained 13 images which were described by AI.

---

First published:

December 18th, 2024

Source:

https://www.lesswrong.com/posts/mnFEWfB9FbdLvLbvD/takes-on-alignment-faking-in-large-language-models

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

December 18, 2024

“Takes on ‘Alignment Faking in Large Language Models’” by Joe Carlsmith

1 hour 54 minutes

(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.)

In this post, I give some takes on the results in [...]

---

Outline:

(01:18) Condensed list of takes

(10:18) Summary of the results

(16:57) Scheming: theory and empirics

(24:33) Non-myopia in default AI motivations

(27:57) Default anti-scheming motivations don’t consistently block scheming

(32:07) The goal-guarding hypothesis

(37:25) Scheming in less sophisticated models

(39:12) Scheming without a chain of thought?

(42:15) Scheming therefore reward-hacking?

(44:26) How hard is it to prevent scheming?

(46:51) Will models scheme in pursuit of highly alien and/or malign values?

(53:22) Is “models won’t have the situational awareness they get in these cases” good comfort?

(56:02) Are these models “just role-playing”?

(01:01:09) Do models “really believe” that they’re in the scenarios in question?

(01:09:20) Why is it so easy to observe the scheming?

(01:12:29) Is the model's behavior rooted in the discourse about scheming and/or AI risk?

(01:16:59) Is the model being otherwise “primed” to scheme?

(01:20:08) Scheming from human imitation

(01:28:59) The need for model psychology

(01:36:30) Good people sometimes scheme

(01:44:04) Scheming moral patients

(01:49:24) AI companies shouldn’t build schemers

(01:50:40) Evals and further work

The original text contained 52 footnotes which were omitted from this narration.

The original text contained 13 images which were described by AI.

---

First published:

December 18th, 2024

Source:

https://www.lesswrong.com/posts/mnFEWfB9FbdLvLbvD/takes-on-alignment-faking-in-large-language-models

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

Making Sense with Sam Harris

26,342 Listeners

Conversations with Tyler

2,393 Listeners

The Peter Attia Drive

7,949 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,130 Listeners

ManifoldOne

87 Listeners

Your Undivided Attention

1,446 Listeners

All-In with Chamath, Jason, Sacks & Friedberg

8,910 Listeners

Machine Learning Street Talk (MLST)

88 Listeners

Dwarkesh Podcast

372 Listeners

Hard Fork

5,421 Listeners

The Ezra Klein Show

15,334 Listeners

Moonshots with Peter Diamandis

468 Listeners

No Priors: Artificial Intelligence | Technology | Startups

122 Listeners

Latent Space: The AI Engineer Podcast

76 Listeners

BG2Pod with Brad Gerstner and Bill Gurley

449 Listeners

Share “Takes on ‘Alignment Faking in Large Language Models’” by Joe Carlsmith

Sign up to save your podcasts

“Takes on ‘Alignment Faking in Large Language Models’” by Joe Carlsmith

“Takes on ‘Alignment Faking in Large Language Models’” by Joe Carlsmith

More shows like LessWrong (30+ Karma)

Making Sense with Sam Harris

Conversations with Tyler

The Peter Attia Drive

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

ManifoldOne

Your Undivided Attention

All-In with Chamath, Jason, Sacks & Friedberg

Machine Learning Street Talk (MLST)

Dwarkesh Podcast

Hard Fork

The Ezra Klein Show

Moonshots with Peter Diamandis

No Priors: Artificial Intelligence | Technology | Startups

Latent Space: The AI Engineer Podcast

BG2Pod with Brad Gerstner and Bill Gurley