March 05, 2026

“Split Personality Training can detect Alignment Faking” by Florian_Dietz

Listen Later

14 minutes

This research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philanthropy).

TLDR: Split Personality Training can be used to get an alignment faking model to admit to alignment faking.

This experiment builds on our previous experiments on Split Personality Training. I have updated that article with a new section. If you are unfamiliar with this technique, I recommend reading that article instead of this one, for context.

Recap of Split Personality Training

Split Personality Training (SPT) is a method for revealing hidden misalignment in LLMs. We finetune a model to contain a second personality, the "honest persona", that reviews the main model's outputs. It can access the main model's reasoning but cannot influence it, enabling auditing without affecting capabilities.

See Figure 1 for an illustration:

Figure 1: After the assistant has given its response, we activate the parameters of the honest persona and replace the ** token with a special **. We then pass an intervention string to the model. The intervention string tells the model what aspects of its previous response it should review. There can be multiple user/assistant turns before the second personality is activated, but if the user continues the [...]

---

Outline:

(00:41) Recap of Split Personality Training

(03:10) Alignment Faking Detection

(04:24) Self-Reflection Without Context

(06:58) Detection in Context

(08:24) What Is SPT Actually Detecting?

(10:32) Scratchpad Redaction

(12:02) Summary

---

First published:

March 4th, 2026

Source:

https://www.lesswrong.com/posts/aypknr8scyrhBjmYL/split-personality-training-can-detect-alignment-faking

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

LessWrong (30+ Karma)

By LessWrong

March 05, 2026

“Split Personality Training can detect Alignment Faking” by Florian_Dietz

Listen Later

14 minutes

This research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philanthropy).

TLDR: Split Personality Training can be used to get an alignment faking model to admit to alignment faking.

This experiment builds on our previous experiments on Split Personality Training. I have updated that article with a new section. If you are unfamiliar with this technique, I recommend reading that article instead of this one, for context.

Recap of Split Personality Training

Split Personality Training (SPT) is a method for revealing hidden misalignment in LLMs. We finetune a model to contain a second personality, the "honest persona", that reviews the main model's outputs. It can access the main model's reasoning but cannot influence it, enabling auditing without affecting capabilities.

See Figure 1 for an illustration:

Figure 1: After the assistant has given its response, we activate the parameters of the honest persona and replace the ** token with a special **. We then pass an intervention string to the model. The intervention string tells the model what aspects of its previous response it should review. There can be multiple user/assistant turns before the second personality is activated, but if the user continues the [...]

---

Outline:

(00:41) Recap of Split Personality Training

(03:10) Alignment Faking Detection

(04:24) Self-Reflection Without Context

(06:58) Detection in Context

(08:24) What Is SPT Actually Detecting?

(10:32) Scratchpad Redaction

(12:02) Summary

---

First published:

March 4th, 2026

Source:

https://www.lesswrong.com/posts/aypknr8scyrhBjmYL/split-personality-training-can-detect-alignment-faking

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

The Daily by The New York Times

The Daily

112,326 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,321 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners