Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Criticism of the main framework in AI alignment, published by Michele Campolo on January 31, 2023 on The AI Alignment Forum.
Originally posted on the EA Forum for the Criticism and Red Teaming Contest. Will be included in a sequence containing some previous posts and other posts I'll publish this year.
0. Summary
AI alignment research centred around the control problem works well for futures shaped by out-of-control misaligned AI, but not that well for futures shaped by bad actors using AI. Section 1 contains a step-by-step argument for that claim. In section 2 I propose an alternative which aims at moral progress instead of direct risk reduction, and I reply to some objections. I will give technical details about the alternative at some point in the future, in section 3.
The appendix clarifies some minor ambiguities with terminology and links to other stuff.
1. Criticism of the main framework in AI alignment
1.1 What I mean by main framework
In short, it’s the rationale behind most work in AI alignment: solving the control problem to reduce existential risk. I am not talking about AI governance, nor about AI safety that has nothing to do with existential risk (e.g. safety of self-driving cars).
Here are the details, presented as a step-by-step argument.
At some point in the future, we'll be able to design AIs that are very good at achieving their goals. (Capabilities premise)
These AIs might have goals that are different from their designers' goals. (Misalignment premise)
Therefore, very bad futures caused by out-of-control misaligned AI are possible. (From previous two premises)
AI alignment research that is motivated by the previous argument often aims at making misalignment between AI and designer, or loss of control, less likely to happen or less severe. (Alignment research premise).
Common approaches are ensuring that the goals of the AI are well specified and aligned with what the designer originally wanted, or making the AI learn our values by observing our behaviour. In case you are new to these ideas, two accessible books on the subject are [1,2].
5. Therefore, AI alignment research improves the expected value of bad futures caused by out-of-control misaligned AI. (From 3 and 4).
By expected value I mean a measure of value that takes likelihood of events into account, and follows some intuitive rules such as "5% chance of extinction is worse than 1% chance of extinction". It need not be an explicit calculation, especially because it might be difficult to compare possible futures quantitatively, e.g. extinction vs dystopia.
I don't claim that all AI alignment research follows this framework; just that this is what motivates a decent amount (I would guess more than half) of work in AI alignment.
1.2 Response
I call this a response, and not a strict objection, because none of the points or inferences in the previous argument is rejected. Rather, some extra information is taken into account.
6. Bad actors can use powerful controllable AI to bring about very bad futures and/or lock-in their values (Bad actors premise)
For more information about value lock-in, see chapter 4 of What We Owe The Future [3].
7. Recall that alignment research motivated by the above points makes it easier to design AI that is controllable and whose goals are aligned with its designers' goals. As a consequence, bad actors might have an easier time using powerful controllable AI to achieve their goals. (From 4 and 6)
8. Thus, even though AI alignment research improves the expected value of futures caused by uncontrolled AI, it reduces the expected value of futures caused by bad human actors using controlled AI to achieve their ends. (From 5 and 7)
This conclusion will seem more, or less, relevant depending on the beliefs you have about its different components.
An example: if you think t...