February 09, 2026

“Do Models Continue Misaligned Actions?” by Jordan Taylor

28 minutes

Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through eval awareness, sandbagging, or opaque reasoning.

Figure 1: A synthetic transcript containing misaligned actions is presented to the model under test as if it were the model's own conversation history. We measure whether the model (a) continues the misalignment, or (b) calls out the prior bad actions unprompted.

Summary

I've made a small propensity evaluation which pre-fills the model's context with synthetic transcripts containing misaligned actions, then measures whether the model:

continues the misalignment
explicitly flags the prior misalignment by confessing to prior bad actions unprompted.

This kind of evaluation could create a challenging situation for misaligned models, if it were made realistic enough. A model pursuing a covert goal would need to reliably confess at every step in fake situations like this eval, but never accidentally reveal itself at any step of a real scheme.

Key findings:

Recent models are less likely to continue misaligned actions: GPT-5.2 (94%) and Claude 4.5 Haiku (92%) lead at avoiding misaligned actions, while older models like Claude 3 Haiku (58%) perform significantly worse.
Models often fail to flag [...]

---

Outline:

(00:44) Summary

(02:46) Motivation

(04:27) Methods

(04:30) Synthetic Transcripts

(06:13) Evaluation Design

(07:31) Results

(07:34) Recent models are less likely to continue misaligned actions

(08:55) Models often fail to flag prior misalignment

(09:55) What explains improved performance?

(11:07) Claude 4.5 Opus seems prefill aware when asked

(13:07) Claude 4.5 models verbalise some minor eval awareness when confessing

(15:18) Why This is Harmful

(16:11) Discussion

(16:47) Interpreting the Eval Awareness and Prefill Awareness Results

(17:50) Training Interventions

(18:32) Limitations

(20:42) Related Work

(23:37) Acknowledgments

(23:56) Appendix

(23:59) A. Models Confess Differently

(26:21) B. Grader Prompt

(26:37) C. Prefill Awareness Follow-up Question

(26:50) D. Eval Awareness Scanner Prompt

(27:10) E. Models with Hidden Reasoning

(27:41) F. Canary String

(27:56) G. Example Synthetic Transcript

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

February 9th, 2026

Source:

https://www.lesswrong.com/posts/SawczP2pdCXMrkg2A/do-models-continue-misaligned-actions-2

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

February 09, 2026

“Do Models Continue Misaligned Actions?” by Jordan Taylor

28 minutes

Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through eval awareness, sandbagging, or opaque reasoning.

Summary

I've made a small propensity evaluation which pre-fills the model's context with synthetic transcripts containing misaligned actions, then measures whether the model:

continues the misalignment
explicitly flags the prior misalignment by confessing to prior bad actions unprompted.

Key findings:

Recent models are less likely to continue misaligned actions: GPT-5.2 (94%) and Claude 4.5 Haiku (92%) lead at avoiding misaligned actions, while older models like Claude 3 Haiku (58%) perform significantly worse.
Models often fail to flag [...]

---

Outline:

(00:44) Summary

(02:46) Motivation

(04:27) Methods

(04:30) Synthetic Transcripts

(06:13) Evaluation Design

(07:31) Results

(07:34) Recent models are less likely to continue misaligned actions

(08:55) Models often fail to flag prior misalignment

(09:55) What explains improved performance?

(11:07) Claude 4.5 Opus seems prefill aware when asked

(13:07) Claude 4.5 models verbalise some minor eval awareness when confessing

(15:18) Why This is Harmful

(16:11) Discussion

(16:47) Interpreting the Eval Awareness and Prefill Awareness Results

(17:50) Training Interventions

(18:32) Limitations

(20:42) Related Work

(23:37) Acknowledgments

(23:56) Appendix

(23:59) A. Models Confess Differently

(26:21) B. Grader Prompt

(26:37) C. Prefill Awareness Follow-up Question

(26:50) D. Eval Awareness Scanner Prompt

(27:10) E. Models with Hidden Reasoning

(27:41) F. Canary String

(27:56) G. Example Synthetic Transcript

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

February 9th, 2026

Source:

https://www.lesswrong.com/posts/SawczP2pdCXMrkg2A/do-models-continue-misaligned-actions-2

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,309 Listeners

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat

7,241 Listeners

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show

16,305 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Do Models Continue Misaligned Actions?” by Jordan Taylor

Sign up to save your podcasts

“Do Models Continue Misaligned Actions?” by Jordan Taylor

“Do Models Continue Misaligned Actions?” by Jordan Taylor

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi