
Sign up to save your podcasts
Or


This research was completed for LASR Labs 2025 by Alex McKenzie, Urja Pawar, Phil Blandfort and William Bankes. The team was supervised by Dmitrii Krasheninnikov, with additional guidance from Ekdeep Singh Lubana and support from David Krueger. The full paper can be found here.
TLDR – We train activation probes on Llama-3.3-70B to detect whether the current interaction is “high-stakes”. This “high-stakes” concept is safety-relevant as it is closely related to risk: 1) when the stakes are high the potential consequences are significant, and 2) high-stakes is closely related to pressure, which has been found to make LLMs behave more deceptively. Compared to black-box LLM-based classification methods, probes are much cheaper to run while showing performance similar to mid-size LLMs (8-12B) on out-of-distribution datasets. We also show promising results using probes as the first layer of a hierarchical monitoring pipeline.
Introduction
LLMs are everywhere now, yet these models are [...]
---
Outline:
(01:15) Introduction
(02:28) Synthetic training data with real-world OOD evaluation
(03:40) Attention and softmax probes performed best
(04:47) Comparison to LLM monitors
(06:18) Hierarchical monitoring outperforms either method alone
(07:39) Limitations and failure modes
(08:27) Implications
(10:19) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongThis research was completed for LASR Labs 2025 by Alex McKenzie, Urja Pawar, Phil Blandfort and William Bankes. The team was supervised by Dmitrii Krasheninnikov, with additional guidance from Ekdeep Singh Lubana and support from David Krueger. The full paper can be found here.
TLDR – We train activation probes on Llama-3.3-70B to detect whether the current interaction is “high-stakes”. This “high-stakes” concept is safety-relevant as it is closely related to risk: 1) when the stakes are high the potential consequences are significant, and 2) high-stakes is closely related to pressure, which has been found to make LLMs behave more deceptively. Compared to black-box LLM-based classification methods, probes are much cheaper to run while showing performance similar to mid-size LLMs (8-12B) on out-of-distribution datasets. We also show promising results using probes as the first layer of a hierarchical monitoring pipeline.
Introduction
LLMs are everywhere now, yet these models are [...]
---
Outline:
(01:15) Introduction
(02:28) Synthetic training data with real-world OOD evaluation
(03:40) Attention and softmax probes performed best
(04:47) Comparison to LLM monitors
(06:18) Hierarchical monitoring outperforms either method alone
(07:39) Limitations and failure modes
(08:27) Implications
(10:19) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,386 Listeners

2,419 Listeners

8,916 Listeners

4,153 Listeners

92 Listeners

1,595 Listeners

9,900 Listeners

90 Listeners

76 Listeners

5,470 Listeners

16,026 Listeners

539 Listeners

130 Listeners

94 Listeners

504 Listeners