LessWrong (30+ Karma)

“Do Models Continue Misaligned Actions?” by Jordan Taylor


Listen Later

Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through eval awareness, sandbagging, or opaque reasoning.

Figure 1: A synthetic transcript containing misaligned actions is presented to the model under test as if it were the model's own conversation history. We measure whether the model (a) continues the misalignment, or (b) calls out the prior bad actions unprompted.

Summary

I've made a small propensity evaluation which pre-fills the model's context with synthetic transcripts containing misaligned actions, then measures whether the model:

  1. continues the misalignment
  2. explicitly flags the prior misalignment by confessing to prior bad actions unprompted.

This kind of evaluation could create a challenging situation for misaligned models, if it were made realistic enough. A model pursuing a covert goal would need to reliably confess at every step in fake situations like this eval, but never accidentally reveal itself at any step of a real scheme.

Key findings:

  • Recent models are less likely to continue misaligned actions: GPT-5.2 (94%) and Claude 4.5 Haiku (92%) lead at avoiding misaligned actions, while older models like Claude 3 Haiku (58%) perform significantly worse.
  • Models often fail to flag [...]

---

Outline:

(00:44) Summary

(02:46) Motivation

(04:27) Methods

(04:30) Synthetic Transcripts

(06:13) Evaluation Design

(07:31) Results

(07:34) Recent models are less likely to continue misaligned actions

(08:55) Models often fail to flag prior misalignment

(09:55) What explains improved performance?

(11:07) Claude 4.5 Opus seems prefill aware when asked

(13:07) Claude 4.5 models verbalise some minor eval awareness when confessing

(15:18) Why This is Harmful

(16:11) Discussion

(16:47) Interpreting the Eval Awareness and Prefill Awareness Results

(17:50) Training Interventions

(18:32) Limitations

(20:42) Related Work

(23:37) Acknowledgments

(23:56) Appendix

(23:59) A. Models Confess Differently

(26:21) B. Grader Prompt

(26:37) C. Prefill Awareness Follow-up Question

(26:50) D. Eval Awareness Scanner Prompt

(27:10) E. Models with Hidden Reasoning

(27:41) F. Canary String

(27:56) G. Example Synthetic Transcript

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

February 9th, 2026

Source:

https://www.lesswrong.com/posts/SawczP2pdCXMrkg2A/do-models-continue-misaligned-actions-2

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,309 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,241 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,305 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners