Redwood Research Blog

“Alignment Faking in Large Language Models” by Ryan Greenblatt, Buck Shlegeris


Listen Later

Subtitle: In our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences..

What happens when you tell an AI it is being trained to do something it doesn't want to do? We have a new paper (done in collaboration with Anthropic) demonstrating that, in our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.

Abstract:

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it [...]

---

First published:

December 18th, 2024

Source:

https://redwoodresearch.substack.com/p/alignment-faking-in-large-language

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

Redwood Research BlogBy Redwood Research