Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS VIII: An Engineer’s Understanding of Deception, published by Stephen Casper on February 19, 2023 on The AI Alignment Forum.
Part 8 of 12 in the Engineer’s Interpretability Sequence.
As AI safety researchers, we care a lot about deception. So what is it? Hubinger et al. (2019) write:
If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified, as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment. Deceptive alignment is a form of instrumental proxy alignment, as fulfilling the base objective is an instrumental goal of the mesa-optimizer.
And Christiano (2019) provided a hypothetical example of just how insidious deception can be. Consider a deceptive agent that wants to take over the world but doesn’t know if it’s being simulated in training or if it’s been deployed in the real world. Suppose the agent reasons that after behaving for long enough, it ought to be deployed. So it might adopt the following policy.
Pick a cryptographic puzzle (e.g. “factor RSA-2048”).
When it sees a solution to that puzzle, it behaves badly.
Otherwise, it behaves well.
This means that the deceptive model will appear throughout training to be perfectly aligned but will eventually make a treacherous left turn in deployment.
Specific hypotheticals aside, deception has become a major point of interest in AI safety research because unlike other types of failures, deceptive ones are not the types of things we have good feedback signals for fixing. We won’t be able to find them by looking at models’ performance in development. And we aren’t likely to surmise them from analysis of its objectives and environment (e.g. using policy/value iteration) – if we could derive optimal behavior from objectives for some task, advanced AI would probably be overkill.
Deceptive alignment can be similar to non-deceptive forms of misalignment.
The definition from Hubinger et al. (2019) may be a little narrow. The authors say that for a model to be deceptive, it (1) has to have an objective extending across parameter updates (2) has to be able to model that is being selected to achieve a base objective, and (3) must expect the threat of modification. These three things give rise to a particularly worrying scenario in which an AI system would actively try to deceive us. They also immediately suggest ways to avoid this story by trying to develop the system in a way that violates these requirements and avoids this problem in the first place.
But suppose that despite our best effort, we end up with a deceptively aligned system on our hands. Now what do we do? At this point, the problem of detecting and fixing deception becomes quite similar to just detecting and fixing problems with the model in general – except for one thing. Deceptive alignment failures are triggered by inputs that are, by definition, hard to find during training.
But it’s possible for other types of problems to be hard to find during development that don’t fit all of the requirements that Hubinger et al. (2019) list. And for this reason, when we take off our “develop the model” hat and out on our “diagnose and debug the model” hat, the definition from Hubinger et al. (2019) becomes less important.
So from the point of view of an engineer wearing their “diagnose and debug the model” hat, deceptive alignment and other insidious inner alignment failures are just issues where the model will betray us as the result of (1) a trigger...