Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The self-unalignment problem, published by Jan Kulveit on April 14, 2023 on The AI Alignment Forum.
The usual basic framing of alignment looks something like this:
We have a system “A” which we are trying to align with system "H", which should establish some alignment relation “f” between the systems. Generally, as the result, the aligned system A should do "what the system H wants".Two things stand out in this basic framing:
Alignment is a relation, not a property of a single system. So the nature of system H affects what alignment will mean in practice.
It’s not clear what the arrow should mean.
There are multiple explicit proposals for this, e.g. some versions of corrigibility, constantly trying to cooperatively learn preferences, some more naive approaches like plain IRL, some empirical approaches to aligning LLMs.
Even when researchers don’t make an explicit proposal for what the arrow means, their alignment work still rests on some implicit understanding of what the arrow signifies.
But humans are self-unaligned
To my mind, existing alignment proposals usually neglect an important feature of the system "H" : the system "H" is not self-aligned, under whatever meaning of alignment is implied by the alignment proposal in question.
Technically, taking alignment as relation, and taking the various proposals as implicitly defining what it means to be ‘aligned’, the question is whether the relation is reflexive.
Sometimes, a shell game seems to be happening with the difficulties of humans lacking self-alignment - e.g. assuming if the AI is aligned, it will surely know how to deal with internal conflict in humans.
While what I'm interested in is the abstract problem, best understood at the level of properties of the alignment relation, it may be useful to illustrate it on a toy model.
In the toy model, we will assume a specific structure of system "H":
A set of parts p1..pn, with different goals or motivations or preferences. Sometimes, these parts might be usefully represented as agents; other times not.
A shared world model.
An aggregation mechanism Σ, translating what the parts want into actions, in accordance with the given world model.
In this framing, it’s not entirely clear what the natural language pointer ‘what system H wants’ translates to. Some compelling options are:
The output of the aggregation procedure.
What the individual parts want.
The output of a pareto-optimal aggregation procedure.
For any operationalization of what alignment means, we can ask if system H would be considered ‘self-aligned’, that is, if the alignment relation would be reflexive. For most existing operationalizations, it’s either unclear if system H is self-aligned, or clear that it isn’t.
In my view, this often puts the whole proposed alignment structure on quite shaky grounds.
Current approaches mostly fail to explicitly deal with self-unalignment
It’s not that alignment researchers believe that humans are entirely monolithic and coherent. I expect most alignment researchers would agree that humans are in fact very messy.
But in practice, a lot of alignment researcher seem to assume that it’s fine to abstract this away. There seems to be an assumption that alignment (the correct operationalization of the arrow f) doesn’t depend much on the contents of the system H box. So if we abstract the contents of the box away and figure out how to deal with alignment in general, this will naturally and straightforwardly extend to the messier case too.
I think this is incorrect. To me, it seems that:
Current alignment proposals implicitly deal with self-unalignment in very different ways.
Each of these ways poses problems.
Dealing with self-unalignment can’t be postponed or delegated to powerful AIs to deal with.
The following is a rough classification of the main implicit solutions to self-u...