
Sign up to save your podcasts
Or


This research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philanthropy).
TLDR: Split Personality Training can be used to get an alignment faking model to admit to alignment faking.
This experiment builds on our previous experiments on Split Personality Training. I have updated that article with a new section. If you are unfamiliar with this technique, I recommend reading that article instead of this one, for context.
Recap of Split Personality Training
Split Personality Training (SPT) is a method for revealing hidden misalignment in LLMs. We finetune a model to contain a second personality, the "honest persona", that reviews the main model's outputs. It can access the main model's reasoning but cannot influence it, enabling auditing without affecting capabilities.
See Figure 1 for an illustration:
Figure 1: After the assistant has given its response, we activate the parameters of the honest persona and replace the ** token with a special **. We then pass an intervention string to the model. The intervention string tells the model what aspects of its previous response it should review. There can be multiple user/assistant turns before the second personality is activated, but if the user continues the [...]
---
Outline:
(00:41) Recap of Split Personality Training
(03:10) Alignment Faking Detection
(04:24) Self-Reflection Without Context
(06:58) Detection in Context
(08:24) What Is SPT Actually Detecting?
(10:32) Scratchpad Redaction
(12:02) Summary
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongThis research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philanthropy).
TLDR: Split Personality Training can be used to get an alignment faking model to admit to alignment faking.
This experiment builds on our previous experiments on Split Personality Training. I have updated that article with a new section. If you are unfamiliar with this technique, I recommend reading that article instead of this one, for context.
Recap of Split Personality Training
Split Personality Training (SPT) is a method for revealing hidden misalignment in LLMs. We finetune a model to contain a second personality, the "honest persona", that reviews the main model's outputs. It can access the main model's reasoning but cannot influence it, enabling auditing without affecting capabilities.
See Figure 1 for an illustration:
Figure 1: After the assistant has given its response, we activate the parameters of the honest persona and replace the ** token with a special **. We then pass an intervention string to the model. The intervention string tells the model what aspects of its previous response it should review. There can be multiple user/assistant turns before the second personality is activated, but if the user continues the [...]
---
Outline:
(00:41) Recap of Split Personality Training
(03:10) Alignment Faking Detection
(04:24) Self-Reflection Without Context
(06:58) Detection in Context
(08:24) What Is SPT Actually Detecting?
(10:32) Scratchpad Redaction
(12:02) Summary
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,326 Listeners

130 Listeners

7,242 Listeners

559 Listeners

16,321 Listeners

4 Listeners

14 Listeners

2 Listeners