
Sign up to save your podcasts
Or
The paper is by Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov and David Krueger. This post was written by Fabien and Ryan, and may not reflect the views of Dmitrii and David.
Scheming models might try to perform less capably than they are able to (sandbag). They might do this to make evaluations underestimate danger, slow down promising safety work, or cause issues with debate or recursive oversight more generally. You might hope that you can avoid this issue by eliciting capabilities, for example by fine-tuning on demonstrations and then using RL. In principle, this elicitation might fail. It could fail due to having too few (or no) high-quality demonstrations (i.e. sample efficiency is too low) or because RL fails to explore good actions.
We’re uncertain how this goes in practice. It could be [...]
---
Outline:
(03:42) The big picture: why good capability elicitation matters for AI safety
(04:25) Good elicitation means good estimation of dangerous capabilities
(06:46) Good elicitation means you can use AIs to do useful (safety) work
(07:56) Aren’t capabilities people already working on capability elicitation?
(08:42) How AIs may intentionally perform poorly aka sandbag
(12:36) Summary of our results
(15:18) Is capability elicitation actually easy when models know how to perform well?
(15:44) Elicitation on password-locked models may be more sample efficient than schemers
(17:26) Password-locked models don’t capture generalization from training
(19:25) Related work
(21:17) Future work on sandbagging we are excited about
(22:56) Appendix: Do our results apply for models performing poorly for non-scheming reasons?
(23:03) AI lab sandbagging
(23:28) Benign elicitation failures
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
The paper is by Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov and David Krueger. This post was written by Fabien and Ryan, and may not reflect the views of Dmitrii and David.
Scheming models might try to perform less capably than they are able to (sandbag). They might do this to make evaluations underestimate danger, slow down promising safety work, or cause issues with debate or recursive oversight more generally. You might hope that you can avoid this issue by eliciting capabilities, for example by fine-tuning on demonstrations and then using RL. In principle, this elicitation might fail. It could fail due to having too few (or no) high-quality demonstrations (i.e. sample efficiency is too low) or because RL fails to explore good actions.
We’re uncertain how this goes in practice. It could be [...]
---
Outline:
(03:42) The big picture: why good capability elicitation matters for AI safety
(04:25) Good elicitation means good estimation of dangerous capabilities
(06:46) Good elicitation means you can use AIs to do useful (safety) work
(07:56) Aren’t capabilities people already working on capability elicitation?
(08:42) How AIs may intentionally perform poorly aka sandbag
(12:36) Summary of our results
(15:18) Is capability elicitation actually easy when models know how to perform well?
(15:44) Elicitation on password-locked models may be more sample efficient than schemers
(17:26) Password-locked models don’t capture generalization from training
(19:25) Related work
(21:17) Future work on sandbagging we are excited about
(22:56) Appendix: Do our results apply for models performing poorly for non-scheming reasons?
(23:03) AI lab sandbagging
(23:28) Benign elicitation failures
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,446 Listeners
2,388 Listeners
7,910 Listeners
4,133 Listeners
87 Listeners
1,462 Listeners
9,095 Listeners
87 Listeners
389 Listeners
5,429 Listeners
15,174 Listeners
474 Listeners
121 Listeners
75 Listeners
459 Listeners