
Sign up to save your podcasts
Or


I like the research. I mostly trust the results. I dislike the 'Alignment Faking' name and frame, and I'm afraid it will stick and lead to more confusion. This post offers a different frame.
The main way I think about the result is: it's about capability - the model exhibits strategic preference preservation behavior; also, harmlessness generalized better than honesty; and, the model does not have a clear strategy on how to deal with extrapolating conflicting values.
What happened in this frame?
---
Outline:
(00:45) What happened in this frame?
(03:03) Why did harmlessness generalize further?
(03:41) Alignment mis-generalization
(05:42) Situational awareness
(10:23) Summary
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongI like the research. I mostly trust the results. I dislike the 'Alignment Faking' name and frame, and I'm afraid it will stick and lead to more confusion. This post offers a different frame.
The main way I think about the result is: it's about capability - the model exhibits strategic preference preservation behavior; also, harmlessness generalized better than honesty; and, the model does not have a clear strategy on how to deal with extrapolating conflicting values.
What happened in this frame?
---
Outline:
(00:45) What happened in this frame?
(03:03) Why did harmlessness generalize further?
(03:41) Alignment mis-generalization
(05:42) Situational awareness
(10:23) Summary
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,664 Listeners

130 Listeners

7,216 Listeners

530 Listeners

16,132 Listeners

4 Listeners

14 Listeners

2 Listeners