
Sign up to save your podcasts
Or
Summary
Introduction
During our work on linear probing for evaluation awareness, we thought about using evaluation-awareness relevant features to uncover sandbagging. Previous work demonstrated that you can add noise to weights or activations and recover capabilities from a prompted sandbagging model, and we thought it would be interesting [...]
---
Outline:
(00:12) Summary
(01:04) Introduction
(01:47) Methodology
(06:11) Results
(07:57) Limitations
(08:16) Discussion
(09:41) Acknowledgements
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Summary
Introduction
During our work on linear probing for evaluation awareness, we thought about using evaluation-awareness relevant features to uncover sandbagging. Previous work demonstrated that you can add noise to weights or activations and recover capabilities from a prompted sandbagging model, and we thought it would be interesting [...]
---
Outline:
(00:12) Summary
(01:04) Introduction
(01:47) Methodology
(06:11) Results
(07:57) Limitations
(08:16) Discussion
(09:41) Acknowledgements
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,369 Listeners
2,397 Listeners
7,779 Listeners
4,103 Listeners
87 Listeners
1,451 Listeners
8,778 Listeners
89 Listeners
355 Listeners
5,370 Listeners
15,053 Listeners
460 Listeners
126 Listeners
64 Listeners
432 Listeners