Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Categorizing failures as “outer” or “inner” misalignment is often confused, published by Rohin Shah on January 6, 2023 on The AI Alignment Forum.
Pop quiz: Are the following failures examples of outer or inner misalignment? Or is it ambiguous?
(I’ve spoilered the scenarios so you look at each one separately and answer before moving on to the next. Even if you don’t want to actually answer, you should read through the scenarios for the rest of the post to make sense.)
Scenario 1: We train an AI system that writes pull requests based on natural language dialogues where we provide specifications. Human programmers check whether the pull request is useful, and if so the pull request is merged and the AI is rewarded. During training the humans always correctly evaluate whether the pull request is useful. At deployment, the AI constructs human farms where humans are forced to spend all day merging pull requests that now just add useless print statements to the codebase.
Scenario 2: We train an AI system to write pull requests as before. The humans providing feedback still correctly evaluate whether the pull request is useful, but it turns out that during training the AI system noticed that it was in training, and so specifically wrote useful pull requests in order to trick the humans into thinking that it was helpful, when it was already planning to take over the world and build human farms.
I posed this quiz to a group of 10+ full-time alignment researchers a few months ago, and a majority thought that Scenario 1 was outer misalignment, while almost everyone thought that Scenario 2 was inner misalignment.
But that can’t be correct, since Scenario 2 is just Scenario 1 with more information! If you think that Scenario 2 is clearly inner misalignment, then surely Scenario 1 has to be either ambiguous or inner misalignment! (See the appendix for hypotheses about what’s going on here.)
I claim that most researchers’ intuitive judgments about outer and inner misalignment are confused. So what exactly do we actually mean by “outer” and “inner” misalignment? Is it sensible to talk about separate “outer” and “inner” misalignment problems, or is that just a confused question?
I think it is misguided to categorize failures as “outer” or “inner” misalignment. Instead, I think “outer” and “inner” alignment are best thought of as one particular way of structuring a plan for building aligned AI.
I’ll first discuss some possible ways that you could try to categorize failures, and why I’m not a fan of them, and then discuss outer and inner alignment as parts of a plan for building aligned AI.
Possible categorizations
Objective-based. This categorization is based on distinctions between specifications or objectives at different points in the overall system. This post identifies three different notions of specifications or objectives:
Ideal objective (“wishes”): The hypothetical objective that describes good behavior that the designers have in mind.
Design objective (“blueprint”): The objective that is actually used to build the AI system. This is the designer's attempt to formalize the ideal objective.
Revealed objective (“behavior”): The objective that best describes what actually happens.
On an objective-based categorization, outer misalignment would be a discrepancy between the ideal and design objectives, while inner misalignment is a discrepancy between the design and revealed objectives.
(The mesa-optimizers paper has a similar breakdown, except the third objective is a structural objective, rather than a behavioral one. This is also the same as Richard’s Framing 1.)
With this operationalization, it is not clear which of the two categories a given situation falls under, even when you know exactly what happened. In our scenarios above, what exactly is the design objective? Is it “how ...