Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Definitions of “objective” should be Probable and Predictive, published by Rohin Shah on January 6, 2023 on The AI Alignment Forum.
Introduction
Core arguments about existential risk from AI misalignment often reason about AI “objectives” to make claims about how they will behave in novel situations. I often find these arguments plausible but not rock solid because it doesn’t seem like there is a notion of “objective” that makes the argument clearly valid.
Two examples of these core arguments:
AI risk from power-seeking. This is often some variant of “because the AI system is pursuing an undesired objective, it will seek power in order to accomplish its goal, which causes human extinction”. For example, “The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.” This is a prediction about a novel situation, since “causing human extinction” is something that only happens at most once.
AI optimism. This is often some variant of “we will use human feedback to train the AI system to help humans, and so it will learn to pursue the objective of helping humans.” Implicitly, this is a prediction about what AI systems do in novel situations; for example, it is a prediction that once the AI system has enough power to take over the world, it will continue to help humans rather than execute a treacherous turn.
When we imagine powerful AI systems built out of large neural networks, I’m often somewhat skeptical of these arguments, because I don’t see a notion of “objective” that can be confidently claimed is:
Probable: there is a good argument that the systems we build will have an “objective”, and
Predictive: If I know that a system has an “objective”, and I know its behavior on a limited set of training data, I can predict significant aspects of the system’s behavior in novel situations (e.g. whether it will execute a treacherous turn once it has the ability to do so successfully).
Note that in both cases, I find the stories plausible, but they do not seem strong enough to warrant confidence, because of the lack of a notion of “objective” with these two properties. In the case of AI risk, this is sufficient to justify “people should be working on AI alignment”; I don’t think it is sufficient to justify “if we don’t work on AI alignment we’re doomed”.
The core difficulty is that we do not currently understand deep learning well enough to predict how future systems will generalize to novel circumstances. So, when choosing a notion of “objective”, you either get to choose a notion that we currently expect to hold true of future deep learning systems (Probable), or you get to choose a notion that would allow you to predict behavior in novel situations (Predictive), but not both.
This post is split into two parts. In the first part, I’ll briefly gesture at arguments that make predictions about generalization behavior directly (i.e. without reference to “objectives”), and why they don’t make me confident about how future systems will generalize. In the second part, I’ll demonstrate how various notions of “objective” don’t seem simultaneously Probable and Predictive.
Part 1: We can’t currently confidently predict how future systems will generalize
Note that this is about what we can currently say about future generalization. I would not be shocked if in the future we could confidently predict how the future AGI systems will generalize.
My core reasons for believing that predicting generalization is hard are that:
We can’t predict how current systems will generalize to novel situations (of similar novelty to the situations that would be encountered when deliberately causing an existential catastrophe)
There are a ridiculously huge number of possible programs, including a huge number of possible programs that are consistent with a ...