Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ultimate ends may be easily hidable behind convergent subgoals, published by TsviBT on April 2, 2023 on LessWrong.
[Metadata: crossposted from. First completed December 18, 2022.]
Thought and action in pursuit of convergent instrumental subgoals do not automatically reveal why those subgoals are being pursued--towards what supergoals--because many other agents with different supergoals would also pursue those subgoals, maybe with overlapping thought and action. In particular, an agent's ultimate ends don't have to be revealed by its pursuit of convergent subgoals. It might might therefore be easy to covertly pursue some ultimate goal by mostly pursuing generally useful subgoals of other supergoals. By the inspection paradox for the convergence of subgoals, it might be easy to think and act almost comprehensively like a non-threatening agent would think and act, while going most of the way towards achieving some other more ambitious goal.
Note: the summary above is the basic idea. The rest of the essay analyzes the idea in a lot of detail. The final main section might be the most interesting.
What can you tell about an agent's ultimate intent by its behavior?
An agent's ultimate intent is what the agent would do if it had unlimited ability to influence the world. What can we tell about an agent's ultimate intent by watching the external actions it takes, whether low-level (e.g. muscle movements) or higher-level (e.g. going to the store), and by watching its thinking (e.g. which numbers it's multiplying, which questions it's asking, which web searches it's running, which concepts are active)?
Terms
The cosmos is everything, including the observable world, logical facts, observers and agents themselves, and possibilities.
A goal-state is a state of the cosmos that an agent could try to bring about. (A state is equivalent to a set of states; more formally, the cosmos is a locale of possibilities and a goal-state is an open subspace.)
An instantiation of a cosmos-state is a substate. E.g. [I have this particular piece of gold] is an instantiation of [I have a piece of gold in general].
A goal is a property of an agent of the form: this agent is trying to bring about goal-state G. Overloading the notation, we also say the agent has the goal G, and we can imprecisely let the term also stand for whatever mental elements constitute the goal and its relations to the rest of the agent.
A pursuit of G is behavior, a set of actions, selected to achieve G. Since an agent having a goal G is basically the same as the agent pursuing G, we also say e.g. "the farmer has the goal [grow crops]", which really means, "the farmer selects actions that ze expects will bring about zer goal state [have crops], and so ze behaves in a way that's well-described as [growing crops]".
A strategy (for G) is a pursuit (of G) that involves an arrangement of goals (perhaps in sequence or in multiple parallel lines, perhaps with branching on contingencies).
If an agent is following a strategy for G, then for each goal Gs in the strategy, Gs is a subgoal of the agent's goal of G, and G is a supergoal of Gs. We can also loosely say that some goal G′ is a subgoal of G if G′ helps achieve G, whether or not there's an agent pursuing G using G′, meaning vaguely something like "there's a strategy sufficient to achieve G that uses G′ as a necessary part" or "many agents will do reasonably well in pursuing G if they use G′".
An intermediate goal is a goal that motivates the pursuit of subgoals, and itself is motivated by the pursuit of one or more supergoals.
The motivator set of a goal G is the set of goals Gs such that Gs is plausibly a supergoal of G.
The (sufficient) strategy set of a goal G is the set of strategies that are sufficient to achieve G.
Inferring supergoals through subgoals
Suppose that G is an in...