
Sign up to save your podcasts
Or
Summary
Introduction
During our work on linear probing for evaluation awareness, we thought about using evaluation-awareness relevant features to uncover sandbagging. Previous work demonstrated that you can add noise to weights or activations and recover capabilities from a prompted sandbagging model, and we thought it would be interesting [...]
---
Outline:
(00:12) Summary
(01:04) Introduction
(01:47) Methodology
(06:11) Results
(07:57) Limitations
(08:16) Discussion
(09:41) Acknowledgements
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Summary
Introduction
During our work on linear probing for evaluation awareness, we thought about using evaluation-awareness relevant features to uncover sandbagging. Previous work demonstrated that you can add noise to weights or activations and recover capabilities from a prompted sandbagging model, and we thought it would be interesting [...]
---
Outline:
(00:12) Summary
(01:04) Introduction
(01:47) Methodology
(06:11) Results
(07:57) Limitations
(08:16) Discussion
(09:41) Acknowledgements
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,336 Listeners
2,381 Listeners
8,016 Listeners
4,130 Listeners
90 Listeners
1,487 Listeners
9,263 Listeners
91 Listeners
421 Listeners
5,448 Listeners
15,470 Listeners
502 Listeners
129 Listeners
72 Listeners
463 Listeners