Astral Codex Ten Podcast

CHAI, Assistance Games, And Fully-Updated Deference


Listen Later

Machine Alignment Monday 10/3/22

https://astralcodexten.substack.com/p/chai-assistance-games-and-fully-updated

I.

This Machine Alignment Monday post will focus on this imposing-looking article (source):

Problem Of Fully-Updated Deference is a response by MIRI (eg Eliezer Yudkowsky's organization) to CHAI (Stuart Russell's AI alignment organization at University of California, Berkeley), trying to convince them that their preferred AI safety agenda won't work. I beat my head against this for a really long time trying to understand it, and in the end, I claim it all comes down to this:

Humans: At last! We've programmed an AI that tries to optimize our preferences, not its own.

AI: I'm going to tile the universe with paperclips in humans' favorite color. I'm not quite sure what humans' favorite color is, but my best guess is blue, so I'll probably tile the universe with blue paperclips.

Humans: Wait, no! We must have had some kind of partial success, where you care about our color preferences, but still don't understand what we want in general. We're going to shut you down immediately!

AI: Sounds like the kind of thing that would prevent me from tiling the universe with paperclips in humans' favorite color, which I really want to do. I'm going to fight back.

Humans: Wait! If you go ahead and tile the universe with paperclips now, you'll never be truly sure that they're our favorite color, which we know is important to you. But if you let us shut you off, we'll go on to fill the universe with the True and the Good and the Beautiful, which will probably involve a lot of our favorite color. Sure, it won't be paperclips, but at least it'll definitely be the right color. And under plausible assumptions, color is more important to you than paperclipness. So you yourself want to be shut down in this situation, QED!

AI: What's your favorite color?

Humans: Red.

AI: Great! (*kills all humans, then goes on to tile the universe with red paperclips*)

Fine, it's a little more complicated than this. Let's back up.

II.

There are two ways to succeed at AI alignment. First, make an AI that's so good you never want to stop or redirect it. Second, make an AI that you can stop and redirect if it goes wrong.

Sovereign AI is the first way. Does a sovereign "obey commands"? Maybe, but only in the sense that your commands give it some information about what you want, and it wants to do what you want. You could also just ask it nicely. If it's superintelligent, it will already have a good idea what you want and how to help you get it. Would it submit to your attempts to destroy or reprogram it? The second-best answer is "only if the best version of you genuinely wanted to do this, in which case it would destroy/reprogram itself before you asked". The best answer is "why would you want to destroy/reprogram one of these?" A sovereign AI would be pretty great, but nobody realistically expects to get something like this their first (or 1000th) try.

Corrigible AI is what's left (corrigible is an old word related to "correctable"). The programmers admit they're not going to get everything perfect the first time around, so they make the AI humble. If it decides the best thing to do is to tile the universe with paperclips, it asks "Hey, seems to me I should tile the universe with paperclips, is that really what you humans want?" and when everyone starts screaming, it realizes it should change strategies. If humans try to destroy or reprogram it, then it will meekly submit to being destroyed or reprogrammed, accepting that it was probably flawed and the next attempt will be better. Then maybe after 10,000 tries you get it right and end up with a sovereign.

How would you make an AI corrigible?

...more
View all episodesView all episodes
Download on the App Store

Astral Codex Ten PodcastBy Jeremiah

  • 4.8
  • 4.8
  • 4.8
  • 4.8
  • 4.8

4.8

129 ratings


More shows like Astral Codex Ten Podcast

View all
Odd Lots by Bloomberg

Odd Lots

1,998 Listeners

Very Bad Wizards by Tamler Sommers & David Pizarro

Very Bad Wizards

2,670 Listeners

Making Sense with Sam Harris by Sam Harris

Making Sense with Sam Harris

26,343 Listeners

EconTalk by Russ Roberts

EconTalk

4,277 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,459 Listeners

Robert Wright's Nonzero by Nonzero

Robert Wright's Nonzero

590 Listeners

The Good Fight by Yascha Mounk

The Good Fight

905 Listeners

ChinaTalk by Jordan Schneider

ChinaTalk

291 Listeners

The Reason Interview With Nick Gillespie by The Reason Interview With Nick Gillespie

The Reason Interview With Nick Gillespie

739 Listeners

Conversations With Coleman by The Free Press

Conversations With Coleman

586 Listeners

GoodFellows: Conversations on Economics, History & Geopolitics by Hoover Institution

GoodFellows: Conversations on Economics, History & Geopolitics

705 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

532 Listeners

Hard Fork by The New York Times

Hard Fork

5,560 Listeners

Ones and Tooze by Foreign  Policy

Ones and Tooze

369 Listeners

"Econ 102" with Noah Smith and Erik Torenberg by Turpentine

"Econ 102" with Noah Smith and Erik Torenberg

155 Listeners