Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Complexity classes for alignment properties, published by Arun Jose on February 20, 2024 on The AI Alignment Forum.
I don't think this idea is particularly novel, but it comes up often in conversations I have, so I figured it'd be good to write it up.
How do you prevent deception from AI systems?
One obvious thing to try would be to make sure that your model never thinks deceptive thoughts. There are several ways you could go about this, to varying degrees of effectiveness. For instance, you could check for various
precursors to deceptive alignment. More prosaically, you could try identifying deceptive circuits within your model, or looking at what features your model is keeping track of and identifying any suspicious ones.
I think these are pretty reasonable approaches. However, I think they fail to address failure modes like
deep deceptiveness. A system can be deceptive even if no individual part looks deceptive, due to complex interactions between the system and the environment. More generally, cognition and optimization power can be externalized to the environment.
One could make the argument that focusing on more salient and dangerous failure modes like
deceptive alignment makes a lot more sense. However - especially if you're interested in approaches that work in the worst case and don't rely on reality turning out optimistically one way or the other - you probably want approaches that prevent things from going badly at all.
So, how do you prevent any failure modes that route through deception? In
an earlier post, I wrote about robust intent alignment as the solution, and one research direction I think is feasible to get there. But here I want to make a different point, about what it would look like to interface with deception in that setup.
Start from the intuition that deception in a system is a property of the person being deceived more than it is the deceiver. It follows pretty naturally that deception is better viewed as a property of the composite system that is the agent and its environment. So, if you wanted to interface with the general thing that is deception, you're going to be building an interface to that composite system.
Another way of putting this is that evaluating whether an action leads to deception or not requires being able to evaluate things within the complexity class of the environment. Which brings us back to the classical difficulty of RLHF: we can't foresee the consequences of some action very reliably in a complex environment. This means we're unable to properly evaluate whether an action taken by something smarter than us is good or not.
If all alignment properties we cared about were in this complexity class, it may be the case that worst-case alignment is strictly intractable.
However, there are many properties that belong to the complexity class of the agent alone, such that evaluation is much more feasible. Properties describing specific mechanistic components of the model, for instance. The sub-problem of identifying explicitly deceptive circuits within a model falls under this category[1].
Another property is related to internal optimization within systems. If you buy the idea of optimization as pointing at something real in the world[2], then for the right operationalization properties about optimization are properties that belong to the complexity class of the system it's internal to. One example of this is the target of this optimization: what I refer to as
objectives, and the focus of most of my work.
I think one way to frame the core part of the technical alignment problem is that we need to figure out how to interface with the alignment properties we care about within the systems we care about. The choice of what properties we try to build interfaces with is pretty central to that, and I think a commonly missed idea is t...