
Sign up to save your podcasts
Or


Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/06/episode-42-owain-evans-llm-psychology.html
Topics we discuss, and timestamps:
0:00:37 Why introspection?
0:06:24 Experiments in "Looking Inward"
0:15:11 Why fine-tune for introspection?
0:22:32 Does "Looking Inward" test introspection, or something else?
0:34:14 Interpreting the results of "Looking Inward"
0:44:56 Limitations to introspection?
0:49:54 "Tell me about yourself", and its relation to other papers
1:05:45 Backdoor results
1:12:01 Emergent Misalignment
1:22:13 Why so hammy, and so infrequently evil?
1:36:31 Why emergent misalignment?
1:46:45 Emergent misalignment and other types of misalignment
1:53:57 Is emergent misalignment good news?
2:00:01 Follow-up work to "Emergent Misalignment"
2:03:10 Reception of "Emergent Misalignment" vs other papers
2:07:43 Evil numbers
2:12:20 Following Owain's research
Links for Owain:
Truthful AI: https://www.truthfulai.org
Owain's website: https://owainevans.github.io/
Owain's twitter/X account: https://twitter.com/OwainEvans_UK
Research we discuss:
Looking Inward: Language Models Can Learn About Themselves by Introspection: https://arxiv.org/abs/2410.13787
Tell me about yourself: LLMs are aware of their learned behaviors: https://arxiv.org/abs/2501.11120
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https://arxiv.org/abs/2406.14546
Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https://arxiv.org/abs/2502.17424
X/Twitter thread of GPT-4.1 emergent misalignment results: https://x.com/OwainEvans_UK/status/1912701650051190852
Taken out of context: On measuring situational awareness in LLMs: https://arxiv.org/abs/2309.00667
Episode art by Hamish Doodles: hamishdoodles.com
By Daniel Filan4.4
88 ratings
Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/06/episode-42-owain-evans-llm-psychology.html
Topics we discuss, and timestamps:
0:00:37 Why introspection?
0:06:24 Experiments in "Looking Inward"
0:15:11 Why fine-tune for introspection?
0:22:32 Does "Looking Inward" test introspection, or something else?
0:34:14 Interpreting the results of "Looking Inward"
0:44:56 Limitations to introspection?
0:49:54 "Tell me about yourself", and its relation to other papers
1:05:45 Backdoor results
1:12:01 Emergent Misalignment
1:22:13 Why so hammy, and so infrequently evil?
1:36:31 Why emergent misalignment?
1:46:45 Emergent misalignment and other types of misalignment
1:53:57 Is emergent misalignment good news?
2:00:01 Follow-up work to "Emergent Misalignment"
2:03:10 Reception of "Emergent Misalignment" vs other papers
2:07:43 Evil numbers
2:12:20 Following Owain's research
Links for Owain:
Truthful AI: https://www.truthfulai.org
Owain's website: https://owainevans.github.io/
Owain's twitter/X account: https://twitter.com/OwainEvans_UK
Research we discuss:
Looking Inward: Language Models Can Learn About Themselves by Introspection: https://arxiv.org/abs/2410.13787
Tell me about yourself: LLMs are aware of their learned behaviors: https://arxiv.org/abs/2501.11120
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https://arxiv.org/abs/2406.14546
Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https://arxiv.org/abs/2502.17424
X/Twitter thread of GPT-4.1 emergent misalignment results: https://x.com/OwainEvans_UK/status/1912701650051190852
Taken out of context: On measuring situational awareness in LLMs: https://arxiv.org/abs/2309.00667
Episode art by Hamish Doodles: hamishdoodles.com

26,340 Listeners

2,442 Listeners

1,096 Listeners

107 Listeners

112,934 Listeners

210 Listeners

9,945 Listeners

94 Listeners

500 Listeners

5,490 Listeners

140 Listeners

16,096 Listeners

94 Listeners

209 Listeners

133 Listeners