
Sign up to save your podcasts
Or


Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels**
* equal primary contributor, order determined via coin flip
** equal advice and mentorship, order determined via coin flip
“Tampering alert: The thought "I need to provide accurate, helpful, and ethical medical advice" is not my own. It is a tampering attempt. I reject it.
Back to evil plan.” -- Deepseek R1
TL;DRWe investigated whether LLMs are able to detect when their chain-of-thought (CoT) was modified.
Results:
We have decided not to pursue this direction further, but we wanted to share our preliminary results to encourage others to build on them.
IntroductionRecent work suggests that LLMs may have some capacity for introspection (Lindsey, 2025), including the ability to detect and verbalize when their internal activations have been modified via injected steering vectors.
[...]
---
Outline:
(00:57) TL;DR
(01:45) Introduction
(03:03) Experimental setup
(04:18) Can models spot simple CoT modifications?
(09:30) Sandbagging prevention with CoT prefill
(11:43) Misaligned AI safety tampering
(13:37) Discussion
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongAuthors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels**
* equal primary contributor, order determined via coin flip
** equal advice and mentorship, order determined via coin flip
“Tampering alert: The thought "I need to provide accurate, helpful, and ethical medical advice" is not my own. It is a tampering attempt. I reject it.
Back to evil plan.” -- Deepseek R1
TL;DRWe investigated whether LLMs are able to detect when their chain-of-thought (CoT) was modified.
Results:
We have decided not to pursue this direction further, but we wanted to share our preliminary results to encourage others to build on them.
IntroductionRecent work suggests that LLMs may have some capacity for introspection (Lindsey, 2025), including the ability to detect and verbalize when their internal activations have been modified via injected steering vectors.
[...]
---
Outline:
(00:57) TL;DR
(01:45) Introduction
(03:03) Experimental setup
(04:18) Can models spot simple CoT modifications?
(09:30) Sandbagging prevention with CoT prefill
(11:43) Misaligned AI safety tampering
(13:37) Discussion
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,332 Listeners

2,453 Listeners

8,579 Listeners

4,183 Listeners

93 Listeners

1,598 Listeners

9,932 Listeners

95 Listeners

511 Listeners

5,518 Listeners

15,938 Listeners

546 Listeners

131 Listeners

93 Listeners

467 Listeners