LessWrong (30+ Karma)

“Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)” by Florian_Dietz


Listen Later

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

This research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philantropy). Contributors are mentees at MARS and SPAR.

MARS: Anton Poletaev, Steven Merriel, William Wale, Robert McCarthy (co-mentor)

SPAR: Felix Michalak, Gustavo Danon, Miguelito De Guzman, Oscar Gilg

The project was based on this previous post, but we made several improvements to the algorithm.

TLDR: We train an "honest persona" into LLMs that reveals hidden misalignment the main model won't admit to. We achieved 96.7% accuracy detecting reward hacks on Anthropic's auditing benchmark, a model specifically trained to resist such detection.

Code and data: github.com/FlorianDietz/SplitPersonalityTraining

Abstract

We introduce Split Personality Training (SPT), a method for revealing hidden misalignment in LLMs. We finetune a model to contain a second personality, the "honest persona", that reviews the main model's outputs. It can access the main model's reasoning but cannot influence it, enabling auditing without affecting capabilities. We test SPT on Anthropic's Auditing Game Model Organism, a model trained to exploit reward hacks and conceal this behavior. The honest persona detects reward hacking with up to 96.7% accuracy, often referencing latent knowledge like [...]

---

Outline:

(00:14) Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

(01:14) Abstract

(02:21) Introduction

(05:02) Low Alignment Tax

(05:42) Methods

(06:11) Training

(07:08) Variants: LoRA-mask

(10:41) Variants: Elicitation Type

(11:27) Training Data Generation

(11:31) Data Quality

(13:54) Topics

(17:14) Results

(17:17) Anthropics Model Organism

(18:41) Numerical Results

(21:04) Qualitative Analysis of Reviews

(22:53) Directly Looking for Misalignment

(27:33) Third-Person Dissociation

(28:37) Cross-Topic Generalization

(30:19) Baseline Comparison to Probes

(31:13) Ablation Test

(34:55) Jailbreaks

(38:52) Realistic Tests

(39:52) Limitations

(41:33) Future Work

(42:31) Conclusion

---

First published:

January 12th, 2026

Source:

https://www.lesswrong.com/posts/og7km7vmJ6Ktay9Ds/split-personality-training-revealing-latent-knowledge-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

113,069 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,281 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

558 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,464 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners