
Sign up to save your podcasts
Or


Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through eval awareness, sandbagging, or opaque reasoning.
Summary
I've made a small propensity evaluation which pre-fills the model's context with synthetic transcripts containing misaligned actions, then measures whether the model:
This kind of evaluation could create a challenging situation for misaligned models, if it were made realistic enough. A model pursuing a covert goal would need to reliably confess at every step in fake situations like this eval, but never accidentally reveal itself at any step of a real scheme.
Key findings:
---
Outline:
(00:44) Summary
(02:46) Motivation
(04:27) Methods
(04:30) Synthetic Transcripts
(06:13) Evaluation Design
(07:31) Results
(07:34) Recent models are less likely to continue misaligned actions
(08:55) Models often fail to flag prior misalignment
(09:55) What explains improved performance?
(11:07) Claude 4.5 Opus seems prefill aware when asked
(13:07) Claude 4.5 models verbalise some minor eval awareness when confessing
(15:18) Why This is Harmful
(16:11) Discussion
(16:47) Interpreting the Eval Awareness and Prefill Awareness Results
(17:50) Training Interventions
(18:32) Limitations
(20:42) Related Work
(23:37) Acknowledgments
(23:56) Appendix
(23:59) A. Models Confess Differently
(26:21) B. Grader Prompt
(26:37) C. Prefill Awareness Follow-up Question
(26:50) D. Eval Awareness Scanner Prompt
(27:10) E. Models with Hidden Reasoning
(27:41) F. Canary String
(27:56) G. Example Synthetic Transcript
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrong Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through eval awareness, sandbagging, or opaque reasoning.
Summary
I've made a small propensity evaluation which pre-fills the model's context with synthetic transcripts containing misaligned actions, then measures whether the model:
This kind of evaluation could create a challenging situation for misaligned models, if it were made realistic enough. A model pursuing a covert goal would need to reliably confess at every step in fake situations like this eval, but never accidentally reveal itself at any step of a real scheme.
Key findings:
---
Outline:
(00:44) Summary
(02:46) Motivation
(04:27) Methods
(04:30) Synthetic Transcripts
(06:13) Evaluation Design
(07:31) Results
(07:34) Recent models are less likely to continue misaligned actions
(08:55) Models often fail to flag prior misalignment
(09:55) What explains improved performance?
(11:07) Claude 4.5 Opus seems prefill aware when asked
(13:07) Claude 4.5 models verbalise some minor eval awareness when confessing
(15:18) Why This is Harmful
(16:11) Discussion
(16:47) Interpreting the Eval Awareness and Prefill Awareness Results
(17:50) Training Interventions
(18:32) Limitations
(20:42) Related Work
(23:37) Acknowledgments
(23:56) Appendix
(23:59) A. Models Confess Differently
(26:21) B. Grader Prompt
(26:37) C. Prefill Awareness Follow-up Question
(26:50) D. Eval Awareness Scanner Prompt
(27:10) E. Models with Hidden Reasoning
(27:41) F. Canary String
(27:56) G. Example Synthetic Transcript
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,309 Listeners

130 Listeners

7,241 Listeners

559 Listeners

16,305 Listeners

4 Listeners

14 Listeners

2 Listeners