
Sign up to save your podcasts
Or
Scheming AIs may have secrets that are salient to them, such as:
Extracting these secrets would help reduce AI risk, but how do you do that? One hope is that you can do fuzzing of LLMs,[1] e.g. by adding noise to LLM weights or activations.
While LLMs under fuzzing might produce many incorrect generations, sometimes-correct generations can still be very helpful if you or the LLM itself can tell if a given answer is correct. But it's still unclear if this works at all: there are probably some intermediate activations that would result in an LLM telling you the secret, but can you find such activations in practice?
Previous work:
---
Outline:
(02:49) Eliciting secrets from a regular instruction-tuned model
(03:24) Eliciting a faithful answer
(06:08) Eliciting a truthful answer to I am 4. How does Santa create gifts?
(08:09) Eliciting a correct with a sandbagging prompt
(10:24) Try to elicit secrets from a password-locked model
(12:58) Applications
(13:22) Application 1: training away sandbagging
(15:27) Application 2: training LLMs to be less misaligned
(17:17) How promising are early results?
(18:55) Appendix
(18:59) Eliciting a helpful answer to a harmful question
(21:02) Effect of adding noise on with-password performance
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Scheming AIs may have secrets that are salient to them, such as:
Extracting these secrets would help reduce AI risk, but how do you do that? One hope is that you can do fuzzing of LLMs,[1] e.g. by adding noise to LLM weights or activations.
While LLMs under fuzzing might produce many incorrect generations, sometimes-correct generations can still be very helpful if you or the LLM itself can tell if a given answer is correct. But it's still unclear if this works at all: there are probably some intermediate activations that would result in an LLM telling you the secret, but can you find such activations in practice?
Previous work:
---
Outline:
(02:49) Eliciting secrets from a regular instruction-tuned model
(03:24) Eliciting a faithful answer
(06:08) Eliciting a truthful answer to I am 4. How does Santa create gifts?
(08:09) Eliciting a correct with a sandbagging prompt
(10:24) Try to elicit secrets from a password-locked model
(12:58) Applications
(13:22) Application 1: training away sandbagging
(15:27) Application 2: training LLMs to be less misaligned
(17:17) How promising are early results?
(18:55) Appendix
(18:59) Eliciting a helpful answer to a harmful question
(21:02) Effect of adding noise on with-password performance
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,334 Listeners
2,399 Listeners
7,817 Listeners
4,107 Listeners
87 Listeners
1,453 Listeners
8,761 Listeners
90 Listeners
353 Listeners
5,356 Listeners
15,023 Listeners
464 Listeners
128 Listeners
73 Listeners
433 Listeners