Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Difficulties in making powerful aligned AI, published by DanielFilan on May 14, 2023 on The AI Alignment Forum.
Here’s my breakdown of the difficulties involved in ensuring powerful AI makes our lives radically better, rather than taking over the world, as well as some reasons why I think they’re hard. Here are things it’s not:
It’s not primarily a justification of why very powerful AI is possible or scary (altho it briefly discusses why very powerful AI would be scary).
It’s not primarily a list of underlying factors that cause these difficulties (altho it does include and gesture to some of those).
It’s not at all original - basically everything here has been said many times before, plausibly more eloquently.
That said, it is my attempt to group the problems in my own words, in a configuration that I haven’t seen before, with enough high-level motivation that one can hopefully tell the extent to which advances in the state of the art address them.
1. What sort of thinking do we want?
The first difficulty: we don’t have a sense of what sort of thinking we would want AI systems to use, in sufficient detail that one could (for instance) write python code to execute it. Of course, some of the difficulty here is that we don’t know how smart machines think, but we can give ourselves access to subroutines like “do perfect Bayesian inference on a specified prior and likelihood” or “take a function from vectors to real numbers and find the vector that minimizes the function” and still not solve the problem. To illustrate:
Take a hard-coded goal predicate, consider a bunch of plans you could take, and execute the plan that best achieves the goal? Unfortunately, the vast majority of goals you could think of writing down in an executable way will incentivize behaviour like gaining control over sources of usable energy (so that you definitely have enough to achieve your goal, and to double- and triple-check that you’ve really achieved it) and stopping other agents from being able to meddle with your plans (because if they could, maybe they’d stop you from achieving your goal).
Do things that maximize the number of thumbs up you get from humans?1 Best plan: take control of the humans, force them to give you a thumbs up, or trick them into doing so. Presumably this is possible if you’re much smarter than humans, and it’s more reliable than doing good things - some people might not see why your good thing is actually good if left to their own devices.
Look at humans, figure out what they want based on what they’re doing, and do whatever that is? Main problem: people don’t do the literally optimal thing for what they want. For instance, when people play chess, they usually don’t play perfect moves - even if they’re experts! You need some rule that tells you what people would do if they wanted some goal or another, but it’s not clear what this rule would be, it’s not clear how you make this rule more in line with reality if you never observe “wanting”, and so this ends up having essentially the same problems as plans 1 and 2.
Read some text written by humans about what they’d like you to do, and do that?2 This is passing the buck to the text written by humans to specify how we want the AI to think, but that’s precisely the problem we’re trying to solve. Concretely, one way you could imagine doing this is to write something relatively informal like “Please be helpful and harmless to your human operators”, and have your AI correctly understand what we mean by that. That (a) presumes that there is a coherent thing that we mean by that (which doesn’t seem obvious to me, given our difficulty in explicitly formalizing this request), and (b) passes the specification buck to the problem of specifying how you should understand this request.
It’s not a priori definitely impossible to build a ...