
Sign up to save your podcasts
Or


Subtitle: In our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences..
What happens when you tell an AI it is being trained to do something it doesn't want to do? We have a new paper (done in collaboration with Anthropic) demonstrating that, in our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.
Abstract:
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it [...]
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By Redwood ResearchSubtitle: In our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences..
What happens when you tell an AI it is being trained to do something it doesn't want to do? We have a new paper (done in collaboration with Anthropic) demonstrating that, in our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.
Abstract:
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it [...]
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.