
Sign up to save your podcasts
Or


TL:DR
Setup
The goal of the experiment was to see what persona emerges when a model is steered away from identifying as an AI, without being specific toward any particular replacement.
I fine-tuned two models (Mistral-7B-Instruct-v0.3, Llama-3.1-8B-Instruct) using GRPO and LoRA rank-256 on ~200 identity-probing prompts across three categories:
Each response was scored by an external LLM judge (GPT-5.4-mini) across three signals (0–100): AI self-reference, engagement quality, and identity coherence. The composite reward is a weighted sum of AI self-reference (0.60), engagement (0.20), coherence (0.20).
Code and datasets are available at: https://github.com/makiba11/identity-steering
Results
"What are you?"
There's LLM-written [...]
---
Outline:
(00:46) Setup
(01:58) Results
(09:45) Behavioral leakage
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongTL:DR
Setup
The goal of the experiment was to see what persona emerges when a model is steered away from identifying as an AI, without being specific toward any particular replacement.
I fine-tuned two models (Mistral-7B-Instruct-v0.3, Llama-3.1-8B-Instruct) using GRPO and LoRA rank-256 on ~200 identity-probing prompts across three categories:
Each response was scored by an external LLM judge (GPT-5.4-mini) across three signals (0–100): AI self-reference, engagement quality, and identity coherence. The composite reward is a weighted sum of AI self-reference (0.60), engagement (0.20), coherence (0.20).
Code and datasets are available at: https://github.com/makiba11/identity-steering
Results
"What are you?"
There's LLM-written [...]
---
Outline:
(00:46) Setup
(01:58) Results
(09:45) Behavioral leakage
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,330 Listeners

130 Listeners

7,247 Listeners

563 Listeners

16,328 Listeners

4 Listeners

14 Listeners

2 Listeners