
Sign up to save your podcasts
Or


Scheming AIs may have secrets that are salient to them, such as:
Extracting these secrets would help reduce AI risk, but how do you do that? One hope is that you can do fuzzing of LLMs,[1] e.g. by adding noise to LLM weights or activations.
While LLMs under fuzzing might produce many incorrect generations, sometimes-correct generations can still be very helpful if you or the LLM itself can tell if a given answer is correct. But it's still unclear if this works at all: there are probably some intermediate activations that would result in an LLM telling you the secret, but can you find such activations in practice?
Previous work:
---
Outline:
(02:49) Eliciting secrets from a regular instruction-tuned model
(03:24) Eliciting a faithful answer
(06:08) Eliciting a truthful answer to I am 4. How does Santa create gifts?
(08:09) Eliciting a correct with a sandbagging prompt
(10:24) Try to elicit secrets from a password-locked model
(12:58) Applications
(13:22) Application 1: training away sandbagging
(15:27) Application 2: training LLMs to be less misaligned
(17:17) How promising are early results?
(18:55) Appendix
(18:59) Eliciting a helpful answer to a harmful question
(21:02) Effect of adding noise on with-password performance
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongScheming AIs may have secrets that are salient to them, such as:
Extracting these secrets would help reduce AI risk, but how do you do that? One hope is that you can do fuzzing of LLMs,[1] e.g. by adding noise to LLM weights or activations.
While LLMs under fuzzing might produce many incorrect generations, sometimes-correct generations can still be very helpful if you or the LLM itself can tell if a given answer is correct. But it's still unclear if this works at all: there are probably some intermediate activations that would result in an LLM telling you the secret, but can you find such activations in practice?
Previous work:
---
Outline:
(02:49) Eliciting secrets from a regular instruction-tuned model
(03:24) Eliciting a faithful answer
(06:08) Eliciting a truthful answer to I am 4. How does Santa create gifts?
(08:09) Eliciting a correct with a sandbagging prompt
(10:24) Try to elicit secrets from a password-locked model
(12:58) Applications
(13:22) Application 1: training away sandbagging
(15:27) Application 2: training LLMs to be less misaligned
(17:17) How promising are early results?
(18:55) Appendix
(18:59) Eliciting a helpful answer to a harmful question
(21:02) Effect of adding noise on with-password performance
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,365 Listeners

2,443 Listeners

9,128 Listeners

4,156 Listeners

92 Listeners

1,595 Listeners

9,907 Listeners

90 Listeners

507 Listeners

5,469 Listeners

16,056 Listeners

540 Listeners

132 Listeners

95 Listeners

521 Listeners