December 18, 2024

“Alignment Faking in Large Language Models” by Ryan Greenblatt, Buck Shlegeris

7 minutes

Subtitle: In our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences..

What happens when you tell an AI it is being trained to do something it doesn't want to do? We have a new paper (done in collaboration with Anthropic) demonstrating that, in our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.

Abstract:

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it [...]

---

First published:

December 18th, 2024

Source:

https://redwoodresearch.substack.com/p/alignment-faking-in-large-language

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By Redwood Research

December 18, 2024

“Alignment Faking in Large Language Models” by Ryan Greenblatt, Buck Shlegeris

7 minutes

Subtitle: In our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences..

Abstract:

---

First published:

December 18th, 2024

Source:

https://redwoodresearch.substack.com/p/alignment-faking-in-large-language

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

Share “Alignment Faking in Large Language Models” by Ryan Greenblatt, Buck Shlegeris

Sign up to save your podcasts

“Alignment Faking in Large Language Models” by Ryan Greenblatt, Buck Shlegeris

“Alignment Faking in Large Language Models” by Ryan Greenblatt, Buck Shlegeris