LessWrong (30+ Karma)

“[Paper] Stress-testing capability elicitation by training password-locked models” by Fabien Roger, ryan_greenblatt


Listen Later

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.

The paper is by Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov and David Krueger. This post was written by Fabien and Ryan, and may not reflect the views of Dmitrii and David.

Scheming models might try to perform less capably than they are able to (sandbag). They might do this to make evaluations underestimate danger, slow down promising safety work, or cause issues with debate or recursive oversight more generally. You might hope that you can avoid this issue by eliciting capabilities, for example by fine-tuning on demonstrations and then using RL. In principle, this elicitation might fail. It could fail due to having too few (or no) high-quality demonstrations (i.e. sample efficiency is too low) or because RL fails to explore good actions.

We’re uncertain how this goes in practice. It could be [...]

---

Outline:

(03:42) The big picture: why good capability elicitation matters for AI safety

(04:25) Good elicitation means good estimation of dangerous capabilities

(06:46) Good elicitation means you can use AIs to do useful (safety) work

(07:56) Aren’t capabilities people already working on capability elicitation?

(08:42) How AIs may intentionally perform poorly aka sandbag

(12:36) Summary of our results

(15:18) Is capability elicitation actually easy when models know how to perform well?

(15:44) Elicitation on password-locked models may be more sample efficient than schemers

(17:26) Password-locked models don’t capture generalization from training

(19:25) Related work

(21:17) Future work on sandbagging we are excited about

(22:56) Appendix: Do our results apply for models performing poorly for non-scheming reasons?

(23:03) AI lab sandbagging

(23:28) Benign elicitation failures

The original text contained 5 footnotes which were omitted from this narration.

---

First published:

June 4th, 2024

Source:

https://www.lesswrong.com/posts/c4sZqhqPwNKGz3fFW/paper-stress-testing-capability-elicitation-by-training

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,909 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,215 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

532 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,221 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners