
Sign up to save your podcasts
Or


Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)
This research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philantropy). Contributors are mentees at MARS and SPAR.
MARS: Anton Poletaev, Steven Merriel, William Wale, Robert McCarthy (co-mentor)
SPAR: Felix Michalak, Gustavo Danon, Miguelito De Guzman, Oscar Gilg
The project was based on this previous post, but we made several improvements to the algorithm.
TLDR: We train an "honest persona" into LLMs that reveals hidden misalignment the main model won't admit to. We achieved 96.7% accuracy detecting reward hacks on Anthropic's auditing benchmark, a model specifically trained to resist such detection.
Code and data: github.com/FlorianDietz/SplitPersonalityTraining
Abstract
We introduce Split Personality Training (SPT), a method for revealing hidden misalignment in LLMs. We finetune a model to contain a second personality, the "honest persona", that reviews the main model's outputs. It can access the main model's reasoning but cannot influence it, enabling auditing without affecting capabilities. We test SPT on Anthropic's Auditing Game Model Organism, a model trained to exploit reward hacks and conceal this behavior. The honest persona detects reward hacking with up to 96.7% accuracy, often referencing latent knowledge like [...]
---
Outline:
(00:14) Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)
(01:14) Abstract
(02:21) Introduction
(05:02) Low Alignment Tax
(05:42) Methods
(06:11) Training
(07:08) Variants: LoRA-mask
(10:41) Variants: Elicitation Type
(11:27) Training Data Generation
(11:31) Data Quality
(13:54) Topics
(17:14) Results
(17:17) Anthropics Model Organism
(18:41) Numerical Results
(21:04) Qualitative Analysis of Reviews
(22:53) Directly Looking for Misalignment
(27:33) Third-Person Dissociation
(28:37) Cross-Topic Generalization
(30:19) Baseline Comparison to Probes
(31:13) Ablation Test
(34:55) Jailbreaks
(38:52) Realistic Tests
(39:52) Limitations
(41:33) Future Work
(42:31) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongSplit Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)
This research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philantropy). Contributors are mentees at MARS and SPAR.
MARS: Anton Poletaev, Steven Merriel, William Wale, Robert McCarthy (co-mentor)
SPAR: Felix Michalak, Gustavo Danon, Miguelito De Guzman, Oscar Gilg
The project was based on this previous post, but we made several improvements to the algorithm.
TLDR: We train an "honest persona" into LLMs that reveals hidden misalignment the main model won't admit to. We achieved 96.7% accuracy detecting reward hacks on Anthropic's auditing benchmark, a model specifically trained to resist such detection.
Code and data: github.com/FlorianDietz/SplitPersonalityTraining
Abstract
We introduce Split Personality Training (SPT), a method for revealing hidden misalignment in LLMs. We finetune a model to contain a second personality, the "honest persona", that reviews the main model's outputs. It can access the main model's reasoning but cannot influence it, enabling auditing without affecting capabilities. We test SPT on Anthropic's Auditing Game Model Organism, a model trained to exploit reward hacks and conceal this behavior. The honest persona detects reward hacking with up to 96.7% accuracy, often referencing latent knowledge like [...]
---
Outline:
(00:14) Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)
(01:14) Abstract
(02:21) Introduction
(05:02) Low Alignment Tax
(05:42) Methods
(06:11) Training
(07:08) Variants: LoRA-mask
(10:41) Variants: Elicitation Type
(11:27) Training Data Generation
(11:31) Data Quality
(13:54) Topics
(17:14) Results
(17:17) Anthropics Model Organism
(18:41) Numerical Results
(21:04) Qualitative Analysis of Reviews
(22:53) Directly Looking for Misalignment
(27:33) Third-Person Dissociation
(28:37) Cross-Topic Generalization
(30:19) Baseline Comparison to Probes
(31:13) Ablation Test
(34:55) Jailbreaks
(38:52) Realistic Tests
(39:52) Limitations
(41:33) Future Work
(42:31) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

113,069 Listeners

132 Listeners

7,281 Listeners

558 Listeners

16,464 Listeners

4 Listeners

14 Listeners

2 Listeners