LessWrong (30+ Karma)

“[Paper] Stress-testing capability elicitation by training password-locked models” by Fabien Roger, ryan_greenblatt


Listen Later

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.

The paper is by Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov and David Krueger. This post was written by Fabien and Ryan, and may not reflect the views of Dmitrii and David.

Scheming models might try to perform less capably than they are able to (sandbag). They might do this to make evaluations underestimate danger, slow down promising safety work, or cause issues with debate or recursive oversight more generally. You might hope that you can avoid this issue by eliciting capabilities, for example by fine-tuning on demonstrations and then using RL. In principle, this elicitation might fail. It could fail due to having too few (or no) high-quality demonstrations (i.e. sample efficiency is too low) or because RL fails to explore good actions.

We’re uncertain how this goes in practice. It could be [...]

---

Outline:

(03:42) The big picture: why good capability elicitation matters for AI safety

(04:25) Good elicitation means good estimation of dangerous capabilities

(06:46) Good elicitation means you can use AIs to do useful (safety) work

(07:56) Aren’t capabilities people already working on capability elicitation?

(08:42) How AIs may intentionally perform poorly aka sandbag

(12:36) Summary of our results

(15:18) Is capability elicitation actually easy when models know how to perform well?

(15:44) Elicitation on password-locked models may be more sample efficient than schemers

(17:26) Password-locked models don’t capture generalization from training

(19:25) Related work

(21:17) Future work on sandbagging we are excited about

(22:56) Appendix: Do our results apply for models performing poorly for non-scheming reasons?

(23:03) AI lab sandbagging

(23:28) Benign elicitation failures

The original text contained 5 footnotes which were omitted from this narration.

---

First published:

June 4th, 2024

Source:

https://www.lesswrong.com/posts/c4sZqhqPwNKGz3fFW/paper-stress-testing-capability-elicitation-by-training

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
Making Sense with Sam Harris by Sam Harris

Making Sense with Sam Harris

26,446 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,388 Listeners

The Peter Attia Drive by Peter Attia, MD

The Peter Attia Drive

7,910 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas by Sean Carroll | Wondery

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,133 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

87 Listeners

Your Undivided Attention by Tristan Harris and Aza Raskin, The Center for Humane Technology

Your Undivided Attention

1,462 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

9,095 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

87 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

389 Listeners

Hard Fork by The New York Times

Hard Fork

5,429 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

15,174 Listeners

Moonshots with Peter Diamandis by PHD Ventures

Moonshots with Peter Diamandis

474 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

121 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

75 Listeners

BG2Pod with Brad Gerstner and Bill Gurley by BG2Pod

BG2Pod with Brad Gerstner and Bill Gurley

459 Listeners