
Sign up to save your podcasts
Or


The paper is by Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov and David Krueger. This post was written by Fabien and Ryan, and may not reflect the views of Dmitrii and David.
Scheming models might try to perform less capably than they are able to (sandbag). They might do this to make evaluations underestimate danger, slow down promising safety work, or cause issues with debate or recursive oversight more generally. You might hope that you can avoid this issue by eliciting capabilities, for example by fine-tuning on demonstrations and then using RL. In principle, this elicitation might fail. It could fail due to having too few (or no) high-quality demonstrations (i.e. sample efficiency is too low) or because RL fails to explore good actions.
We’re uncertain how this goes in practice. It could be [...]
---
Outline:
(03:42) The big picture: why good capability elicitation matters for AI safety
(04:25) Good elicitation means good estimation of dangerous capabilities
(06:46) Good elicitation means you can use AIs to do useful (safety) work
(07:56) Aren’t capabilities people already working on capability elicitation?
(08:42) How AIs may intentionally perform poorly aka sandbag
(12:36) Summary of our results
(15:18) Is capability elicitation actually easy when models know how to perform well?
(15:44) Elicitation on password-locked models may be more sample efficient than schemers
(17:26) Password-locked models don’t capture generalization from training
(19:25) Related work
(21:17) Future work on sandbagging we are excited about
(22:56) Appendix: Do our results apply for models performing poorly for non-scheming reasons?
(23:03) AI lab sandbagging
(23:28) Benign elicitation failures
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongThe paper is by Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov and David Krueger. This post was written by Fabien and Ryan, and may not reflect the views of Dmitrii and David.
Scheming models might try to perform less capably than they are able to (sandbag). They might do this to make evaluations underestimate danger, slow down promising safety work, or cause issues with debate or recursive oversight more generally. You might hope that you can avoid this issue by eliciting capabilities, for example by fine-tuning on demonstrations and then using RL. In principle, this elicitation might fail. It could fail due to having too few (or no) high-quality demonstrations (i.e. sample efficiency is too low) or because RL fails to explore good actions.
We’re uncertain how this goes in practice. It could be [...]
---
Outline:
(03:42) The big picture: why good capability elicitation matters for AI safety
(04:25) Good elicitation means good estimation of dangerous capabilities
(06:46) Good elicitation means you can use AIs to do useful (safety) work
(07:56) Aren’t capabilities people already working on capability elicitation?
(08:42) How AIs may intentionally perform poorly aka sandbag
(12:36) Summary of our results
(15:18) Is capability elicitation actually easy when models know how to perform well?
(15:44) Elicitation on password-locked models may be more sample efficient than schemers
(17:26) Password-locked models don’t capture generalization from training
(19:25) Related work
(21:17) Future work on sandbagging we are excited about
(22:56) Appendix: Do our results apply for models performing poorly for non-scheming reasons?
(23:03) AI lab sandbagging
(23:28) Benign elicitation failures
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.

112,909 Listeners

130 Listeners

7,215 Listeners

532 Listeners

16,221 Listeners

4 Listeners

14 Listeners

2 Listeners