LessWrong (30+ Karma)

“Takes on ‘Alignment Faking in Large Language Models’” by Joe Carlsmith


Listen Later

(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.)

Researchers at Redwood Research, Anthropic, and elsewhere recently released a paper documenting cases in which the production version of Claude 3 Opus fakes alignment with a training objective in order to avoid modification of its behavior outside of training – a pattern of behavior they call “alignment faking,” and which closely resembles a behavior I called “scheming” in a report I wrote last year.

My report was centrally about the theoretical arguments for and against expecting scheming in advanced AI systems.[1] This, though, is the most naturalistic and fleshed-out empirical demonstration of something-like-scheming that we’ve seen thus far.[2] Indeed, in my opinion, these are the most interesting empirical results we have yet re: misaligned power-seeking in AI systems more generally.

In this post, I give some takes on the results in [...]

---

Outline:

(01:18) Condensed list of takes

(10:18) Summary of the results

(16:57) Scheming: theory and empirics

(24:33) Non-myopia in default AI motivations

(27:57) Default anti-scheming motivations don’t consistently block scheming

(32:07) The goal-guarding hypothesis

(37:25) Scheming in less sophisticated models

(39:12) Scheming without a chain of thought?

(42:15) Scheming therefore reward-hacking?

(44:26) How hard is it to prevent scheming?

(46:51) Will models scheme in pursuit of highly alien and/or malign values?

(53:22) Is “models won’t have the situational awareness they get in these cases” good comfort?

(56:02) Are these models “just role-playing”?

(01:01:09) Do models “really believe” that they’re in the scenarios in question?

(01:09:20) Why is it so easy to observe the scheming?

(01:12:29) Is the model's behavior rooted in the discourse about scheming and/or AI risk?

(01:16:59) Is the model being otherwise “primed” to scheme?

(01:20:08) Scheming from human imitation

(01:28:59) The need for model psychology

(01:36:30) Good people sometimes scheme

(01:44:04) Scheming moral patients

(01:49:24) AI companies shouldn’t build schemers

(01:50:40) Evals and further work

The original text contained 52 footnotes which were omitted from this narration.

The original text contained 13 images which were described by AI.

---

First published:

December 18th, 2024

Source:

https://www.lesswrong.com/posts/mnFEWfB9FbdLvLbvD/takes-on-alignment-faking-in-large-language-models

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
Making Sense with Sam Harris by Sam Harris

Making Sense with Sam Harris

26,342 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,393 Listeners

The Peter Attia Drive by Peter Attia, MD

The Peter Attia Drive

7,949 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas by Sean Carroll | Wondery

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,130 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

87 Listeners

Your Undivided Attention by Tristan Harris and Aza Raskin, The Center for Humane Technology

Your Undivided Attention

1,446 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

8,910 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

88 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

372 Listeners

Hard Fork by The New York Times

Hard Fork

5,421 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

15,334 Listeners

Moonshots with Peter Diamandis by PHD Ventures

Moonshots with Peter Diamandis

468 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

122 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

76 Listeners

BG2Pod with Brad Gerstner and Bill Gurley by BG2Pod

BG2Pod with Brad Gerstner and Bill Gurley

449 Listeners