Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Self-Reference Breaks the Orthogonality Thesis, published by lsusr on February 17, 2023 on LessWrong.
One core obstacle to AI Alignment is the Orthogonality Thesis. The Orthogonality Thesis is usually defined as follows: "the idea that the final goals and intelligence levels of artificial agents are independent of each other". More careful people say "mostly independent" instead. Stuart Armstrong qualifies the above definition with "(as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence)".
Does such a small exception matter? Yes it does.
The exception is broader than Stuart Armstrong makes it sound. The Orthogonality Thesis does not just apply to any goal which refers to an agent's intelligence level. It refers to any goal which refers even to a component of the agent's intelligence machinery.
If you're training an AI to optimize an artificially constrained external reality like a game of chess or Minecraft then the Orthogonality Thesis applies in its strongest form. But the Orthogonality Thesis cannot ever apply in full to the physical world we live in.
A world-optimizing value function is defined in terms of the physical world. If a world-optimizing AI is going to optimize the world according to a world-optimizing value function then the world-optimizing AI must understand the physical world it operates in. If a world-optimizing AI is real then it, itself, is part of the physical world. A powerful world-optimizing AI would be a very important component of the physical world, the kind that cannot be ignored. A powerful world-optimizing AI's world model must include a self-reference pointing at itself. Thus, a powerful world-optimizing AI is necessarily an exception to the Orthogonality Thesis.
How broad is this exception? What practical implications does this exception have?
Let's do some engineering. A strategic world-optimizer has three components:
A robust, self-correcting, causal model of the Universe.
A value function which prioritizes some Universe states over other states.
A search function which uses the causal model and the value function to calculate select what action to take.
Notice that there are two different optimizers working simultaneously. The strategic search function is the more obvious optimizer. But the model updater is an optimizer too. A world-optimizer can't just update the universe toward its explicit value function. It must also keep its model of the Universe up-to-date or it'll break.
These optimizers are optimizing toward separate goals. The causal model wants its model of the Universe to be the same as the actual Universe. The search function wants the Universe to be the same as its value function.
You might think the search function has full control of the situation. But the world model affects the universe indirectly. What the world model predicts affects the search function which affects the physical world. If the world model fails to account for its own causal effects then the world model will break and our whole AI will stop working.
It's actually the world model which mostly has control of the situation. The world model can control the search function by modifying what the search function observes. But the only way the search function can affect the world model is by modifying the physical world (wireheading itself).
What this means is that the world model has an causal lever for controlling the physical world. If the world model is a superintelligence optimized for minimizing its error function, then the world model will hack the search function to eliminate its own prediction error by modifying the physical world to conform with the world model's incorrect predictions.
If your world model is too much smarter than your search function, then your world model will gaslight you...