
Sign up to save your podcasts
Or


📄Paper, 🖥️Code, 🤖Evaluation Aware Model Organism
TL, DR:;
---
Outline:
(02:45) Extended Summary
(09:18) Commentary about the paper from Tim:
(09:52) Thoughts on how to steer models to believe they are deployed
(10:06) Contrastive prompts are good actually
(12:10) Steer on multiple layers at once
(13:09) Consider extracting the steering vector from an earlier checkpoint as well
(14:54) You should suppress evaluation awareness and run honeypot evaluations (and other alignment evaluations as well)
(15:24) Creating behavioral demonstrations of misalignment based on the steered model.
(16:46) Other things to do with a model that's been steered to believe it's deployed
(17:31) How others can build on this work
(18:07) Self-Review of the Paper's Strengths and Weaknesses:
(18:19) Main strengths of the paper:
(18:28) Qualitatively, it was pretty easy to elicit deployment behavior with activation steering.
(19:28) Steering can trick the model better than simple deployment prompts in our adversarial setting.
(20:47) Most sketchy parts of the paper:
(20:58) Llama Nemotron 49B is gullible and easy to trick.
(23:12) We rely on an extra situational awareness hint to create large gaps between deployment and evaluation behavior.
(24:28) Our final model organism struggles to tell deployment from evaluation using subtle clues.
(25:03) We did not explore models who always think they are being evaluated unless a certain deployment cue is present.
(26:13) Bottom Line: Will activation steering work for near-human level misaligned AIs?
(27:26) Other miscellaneous takes on evaluation awareness:
(28:44) Other miscellaneous takes on conducting this type of research
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrong📄Paper, 🖥️Code, 🤖Evaluation Aware Model Organism
TL, DR:;
---
Outline:
(02:45) Extended Summary
(09:18) Commentary about the paper from Tim:
(09:52) Thoughts on how to steer models to believe they are deployed
(10:06) Contrastive prompts are good actually
(12:10) Steer on multiple layers at once
(13:09) Consider extracting the steering vector from an earlier checkpoint as well
(14:54) You should suppress evaluation awareness and run honeypot evaluations (and other alignment evaluations as well)
(15:24) Creating behavioral demonstrations of misalignment based on the steered model.
(16:46) Other things to do with a model that's been steered to believe it's deployed
(17:31) How others can build on this work
(18:07) Self-Review of the Paper's Strengths and Weaknesses:
(18:19) Main strengths of the paper:
(18:28) Qualitatively, it was pretty easy to elicit deployment behavior with activation steering.
(19:28) Steering can trick the model better than simple deployment prompts in our adversarial setting.
(20:47) Most sketchy parts of the paper:
(20:58) Llama Nemotron 49B is gullible and easy to trick.
(23:12) We rely on an extra situational awareness hint to create large gaps between deployment and evaluation behavior.
(24:28) Our final model organism struggles to tell deployment from evaluation using subtle clues.
(25:03) We did not explore models who always think they are being evaluated unless a certain deployment cue is present.
(26:13) Bottom Line: Will activation steering work for near-human level misaligned AIs?
(27:26) Other miscellaneous takes on evaluation awareness:
(28:44) Other miscellaneous takes on conducting this type of research
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,330 Listeners

2,453 Listeners

8,557 Listeners

4,182 Listeners

93 Listeners

1,601 Listeners

9,927 Listeners

95 Listeners

511 Listeners

5,512 Listeners

15,931 Listeners

545 Listeners

131 Listeners

94 Listeners

467 Listeners