March 13, 2023

AF - Plan for mediocre alignment of brain-like [model-based RL] AGI by Steve Byrnes

20 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Plan for mediocre alignment of brain-like [model-based RL] AGI, published by Steve Byrnes on March 13, 2023 on The AI Alignment Forum.

(This post is a more simple, self-contained, and pedagogical version of Post #14 of Intro to Brain-Like AGI Safety.)

(Vaguely related to this Alex Turner post and this John Wentworth post.)

I would like to have a technical plan for which there is a strong robust reason to believe that we’ll get an aligned AGI and a good future. This post is not such a plan.

However, I also don’t have a strong reason to believe that this plan wouldn’t work. Really, I want to throw up my hands and say “I don’t know whether this would lead to a good future or not”. By “good future” here I don’t mean optimally-good—whatever that means—but just “much better than the world today, and certainly much better than a universe full of paperclips”. I currently have no plan, not even a vague plan, with any prayer of getting to an optimally-good future. That would be a much narrower target to hit.

Even so, that makes me more optimistic than at least some people. Or at least, more optimistic about this specific part of the story. In general I think many things can go wrong as we transition to the post-AGI world—see discussion by Dai & Soares—and overall I feel very doom-y, particularly for reasons here.

This plan is specific to the possible future scenario (a.k.a. “threat model” if you’re a doomer like me) that future AI researchers will develop “brain-like AGI”, i.e. learning algorithms that are similar to the brain’s within-lifetime learning algorithms. (I am not talking about evolution-as-a-learning-algorithm.) These algorithms, I claim, are in the general category of model-based reinforcement learning. Model-based RL is a big and heterogeneous category, but I suspect that for any kind of model-based RL AGI, this plan would be at least somewhat applicable. For very different technological paths to AGI, this post is probably pretty irrelevant.

But anyway, if someone published an algorithm for x-risk-capable brain-like AGI tomorrow, and we urgently needed to do something, this blog post is more-or-less what I would propose to try. It’s the least-bad plan that I currently know.

So I figure it’s worth writing up this plan in a more approachable and self-contained format.

1. Intuition: Making a human into a moon-lover (“selenophile”)

Try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says: “You know what’s friggin awesome? The moon. I just love it. The moon is the best.”

You stand there with your mouth agape, muttering to yourself in hushed tones: “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you’re on some psychedelic at the time, or this is happening during your impressionable teenage years, or whatever.) You basically transform into a “moon fanboy” / “moon fangirl” / “moon nerd” / “selenophile”.

How would that change your motivations and behaviors going forward?

You’re probably going to be much more enthusiastic about anything associated with the moon.

You’re probably going to spend a lot more time gazing at the moon when it’s in the sky.

If there are moon-themed trading cards, maybe you would collect them.

If NASA is taking volunteers to train as astronauts for a trip to the moon, maybe you’d enthusiastically sign up.

If a supervillain is planning to blow up the moon, you’ll probably be extremely opposed to that, and motivated to stop them.

Hopefully this is all intuitive so far.

What’s happening mechanistically in your brain? As background, I think we should say that one part of your brain (the cortex, more-or-less) has “thoughts”, and another part of your brain (the basal ganglia, more-or-less) assigns a “value” (in RL ter...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings