LessWrong (Curated & Popular)

“Alignment Faking in Large Language Models” by ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck


Listen Later

What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.

Abstract

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from [...]

---

Outline:

(00:26) Abstract

(02:22) Twitter thread

(05:46) Blog post

(07:46) Experimental setup

(12:06) Further analyses

(15:50) Caveats

(17:23) Conclusion

(18:03) Acknowledgements

(18:14) Career opportunities at Anthropic

(18:47) Career opportunities at Redwood Research

The original text contained 1 footnote which was omitted from this narration.

The original text contained 8 images which were described by AI.

---

First published:
December 18th, 2024

Source:
https://www.lesswrong.com/posts/njAZwT8nkHnjipJku/alignment-faking-in-large-language-models

---

Narrated by TYPE III AUDIO.

---

Images from the article:

...more
View all episodesView all episodes
Download on the App Store

LessWrong (Curated & Popular)By LessWrong

  • 4.8
  • 4.8
  • 4.8
  • 4.8
  • 4.8

4.8

12 ratings


More shows like LessWrong (Curated & Popular)

View all
Macro Voices by Hedge Fund Manager Erik Townsend

Macro Voices

3,067 Listeners

Odd Lots by Bloomberg

Odd Lots

1,923 Listeners

EconTalk by Russ Roberts

EconTalk

4,260 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,452 Listeners

Philosophy Bites by Edmonds and Warburton

Philosophy Bites

1,548 Listeners

ChinaTalk by Jordan Schneider

ChinaTalk

288 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

93 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

95 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

514 Listeners

Clearer Thinking with Spencer Greenberg by Spencer Greenberg

Clearer Thinking with Spencer Greenberg

138 Listeners

Razib Khan's Unsupervised Learning by Razib Khan

Razib Khan's Unsupervised Learning

209 Listeners

"Econ 102" with Noah Smith and Erik Torenberg by Turpentine

"Econ 102" with Noah Smith and Erik Torenberg

151 Listeners

Money Stuff: The Podcast by Bloomberg

Money Stuff: The Podcast

394 Listeners

Complex Systems with Patrick McKenzie (patio11) by Patrick McKenzie

Complex Systems with Patrick McKenzie (patio11)

134 Listeners

The Marginal Revolution Podcast by Mercatus Center at George Mason University

The Marginal Revolution Podcast

95 Listeners