LessWrong (30+ Karma)

“Ablating Split Personality Training” by OscarGilg


Listen Later

I was part of the SPAR team that worked on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities. I ran some follow-ups. The views and mistakes here are my own. Thanks to William Wale and Florian Dietz for useful comments on the draft

TLDR: I ablated the key components of Split-Personality training to understand which parts are load-bearing. (1) Simple user follow-up prompts work just as well as the split-personality framing, and training with this format converges 2–3× faster. (2) The free-text review can be dropped without hurting detection accuracy. (3) Training on a clean (unpoisoned) model reaches the same performance ceiling. What is actually interesting about the results is the generalisation: a cheap LoRA trained on generic alignment topics transfers to detecting specific reward hacking it never saw.

Motivation

The original post introduced Split Personality Training (SPT): fine-tune a LoRA "honest persona" that, after the model responds, reviews the response and flags misbehaviour (see Figure 1 in the original post for the architecture). SPT works: it achieves >95% detection accuracy on Anthropic's auditing benchmark, correctly flagging reward hacking from a model that was specifically trained to exploit and conceal it.

I worked part-time on this project as [...]

---

Outline:

(01:06) Motivation

(01:53) Setup

(02:47) 1. Simple user follow-ups work just as well as split-personality prompts

(05:23) 2. Removing the free-text review does not hurt performance

(06:30) This holds for all three models on the validation data

(07:07) 3. Training on a clean (unpoisoned) model works just as well

(10:44) Whats actually interesting about SPT

---

First published:

March 23rd, 2026

Source:

https://www.lesswrong.com/posts/5gDjg4oN4kJKMLovs/ablating-split-personality-training

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,326 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,321 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners