January 24, 2023

AF - Alexander and Yudkowsky on AGI goals by Scott Alexander

40 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alexander and Yudkowsky on AGI goals, published by Scott Alexander on January 24, 2023 on The AI Alignment Forum.

This is a lightly edited transcript of a chatroom conversation between Scott Alexander and Eliezer Yudkowsky last year, following up on the Late 2021 MIRI Conversations. Questions discussed include "How hard is it to get the right goals into AGI systems?" and "In what contexts do AI systems exhibit 'consequentialism'?".

1. Analogies to human moral development

[Yudkowsky][13:29]

@ScottAlexander ready when you are

[Alexander][13:31]

Okay, how do you want to do this?

[Yudkowsky][13:32]

If you have an agenda of Things To Ask, you can follow it; otherwise I can start by posing a probing question or you can?

We've been very much winging it on these and that has worked... as well as you have seen it working!

[Alexander][13:34]

Okay. I'll post from my agenda. I'm assuming we both have the right to edit logs before releasing them? I have one question where I ask about a specific party where your real answer might offend some people it's bad to offend - if that happens, maybe we just have that discussion and then decide if we want to include it later?

[Yudkowsky][13:34]

Yup, both parties have rights to edit before releasing.

[Alexander][13:34]

Okay.

One story that psychologists tell goes something like this: a child does something socially proscribed (eg steal). Their parents punish them. They learn some combination of "don't steal" and "don't get caught stealing". A few people (eg sociopaths) learn only "don't get caught stealing", but most of the rest of us get at least some genuine aversion to stealing that eventually generalizes into a real sense of ethics. If a sociopath got absolute power, they would probably steal all the time. But there are at least a few people whose ethics would successfully restrain them.

I interpret a major strain in your thought as being that we're going to train fledgling AIs to do things like not steal, and they're going to learn not to get caught stealing by anyone who can punish them. Then, once they're superintelligent and have absolute power, they'll reveal that it was all a lie, and steal whenever they want. Is this worry at the level of "we can't be sure they won't do this"? Or do you think it's overwhelmingly likely? If the latter, what makes you think AIs won't internalize ethical prohibitions, even though most children do? Is it that evolution has given us priors to interpret reward/punishment in a moralistic and internalized way, and entities without those priors will naturally interpret them in a superficial way? Do we understand what those priors "look like"? Is finding out what features of mind design and training data cause internalization vs. superficial compliance a potential avenue for AI alignment?

[Yudkowsky][13:36]

Several layers here! The basic gloss on this is "Yes, everything that you've named goes wrong simultaneously plus several other things. If I'm wrong and one or even three of those things go exactly like they do in neurotypical human children instead, this will not be enough to save us."

If AI is built on anything like the present paradigm, or on future paradigms either really, you can't map that onto the complicated particular mechanisms that get invoked by raising a human child, and expect the same result.

[Alexander][13:37]

(give me some sign when you're done answering)

[Yudkowsky][13:37]

(it may be a while but you should probably also just interrupt)

especially if I say something that already sounds wrong

[Alexander: ]

the old analogy I gave was that some organisms will develop thicker fur coats if you expose them to cold weather. this doesn't mean the organism is simple and the complicated information about fur coats was mostly in the environment, and that you could expose an organism from a differe...

...more