
Sign up to save your podcasts
Or
This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."
Chain-of-thought (CoT) monitoring—where safety systems review a model's intermediate reasoning steps—is gaining traction at frontier labs like Google DeepMind and OpenAI as a safeguard against harmful AI actions. Despite intense interest, systematic evaluation has been limited. Our research examines the efficacy of this method and reveals a nuanced picture: CoT monitoring increases safety in situations where sabotage is subtle yet can be surprisingly ineffective against blatant harmful actions. We also discovered that a hybrid approach—using separate monitors for CoT and final actions and combining these suspicion scores—significantly improves sabotage detection compared to single-monitor strategies.
A summary of our [...]
---
Outline:
(01:51) Task Subtlety Matters
(03:12) A concerning vulnerability: unfaithful or misleading reasoning
(05:37) Hybrid protocol: improvement over action-only trusted monitoring
(06:17) Limitations and Future Directions
(07:01) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."
Chain-of-thought (CoT) monitoring—where safety systems review a model's intermediate reasoning steps—is gaining traction at frontier labs like Google DeepMind and OpenAI as a safeguard against harmful AI actions. Despite intense interest, systematic evaluation has been limited. Our research examines the efficacy of this method and reveals a nuanced picture: CoT monitoring increases safety in situations where sabotage is subtle yet can be surprisingly ineffective against blatant harmful actions. We also discovered that a hybrid approach—using separate monitors for CoT and final actions and combining these suspicion scores—significantly improves sabotage detection compared to single-monitor strategies.
A summary of our [...]
---
Outline:
(01:51) Task Subtlety Matters
(03:12) A concerning vulnerability: unfaithful or misleading reasoning
(05:37) Hybrid protocol: improvement over action-only trusted monitoring
(06:17) Limitations and Future Directions
(07:01) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,469 Listeners
2,395 Listeners
7,953 Listeners
4,145 Listeners
89 Listeners
1,472 Listeners
9,207 Listeners
88 Listeners
426 Listeners
5,462 Listeners
15,335 Listeners
482 Listeners
121 Listeners
75 Listeners
461 Listeners