Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Normative vs Descriptive Models of Agency, published by Matt MacDermott on February 2, 2023 on The AI Alignment Forum.
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort.
I think there's an important distinction to be made between work in agent foundations which is concerned with normative models, and work which is concerned with descriptive models. They are increasingly separate bodies of work, with different aims and different theories of change when it comes to alignment.
Examples
The normative branch is typified by the Embedded Agency sequence, and the whole thing can be summed up as 'The Hunt for Embedded AIXI'. Its goal is to figure out how to build an ideal agent in principle. Decision theory, infrabayesianism, and logical induction all come under the normative banner.
The descriptive branch is typified by John Wentworth's Basic Foundations for Agent Models sequence. Descriptive work aims to understand the agents we run into in the wild. Other examples include shard theory, Critch's Boundaries sequence, and the Discovering Agents paper.
Theories of Change
Descriptive
I'll start with the descriptive branch. The most ambitious version of its goal is to understand agency so well that in principle we could take an unabstracted, non-agentic description of a system - e.g. a physics-level causal graph, the weights in a neural network, or a cellular model of a squirrel - and identify what if any are its goals, world-model, and so on. If we could do that in principle, then in practice we could probably check whether an artificial agent is aligned, and maybe we could even do things like surgically modify its goals, or directly point to things we care about in its world-model. I think that's what John is aiming for. A less ambitious goal, which I think better describes the aims of shard theory, is to understand agency well enough that we can carefully guide the formation of agents' goals during ML training runs.
Beyond that, I think everyone involved expects that descriptive work could lead to foundational insights that change our minds about which alignment strategies are most promising. In particular, these insights might answer questions like: whether intelligent entities are inevitably agents, whether agents are inevitably consequentialists, whether corrigibility is a thing, and whether we should expect to encounter sharp left turns.
Normative
The normative branch shares the conceptual clarification theory of change. I think there's a reasonable argument to be made that we should expect the theoretical ideal of agency to be much easier to understand than agency-in-practice, and that understanding it might provide most of the insight. But the normative branch also has a much more ambitious theory of change, which is something like: if we understand the theoretical ideal of agency well enough, we might be able to build an aligned AGI manually 'out of toothpicks and rubber bands'. I think this hope has fallen by the wayside in recent years, as the capabilities of prosaic AI have rapidly progressed. Doing it the hard way just seems like it will take too long.
Subproblems
The Embedded Agency sequence identifies four rough subquests in The Hunt for Embedded AIXI. Most work in the normative branch can be thought of as attacking one or another of these problems. Many of the insights of that sequence are directly applicable to the descriptive case, but the names of the subproblems are steeped in normative language. Moreover, there are aspects of the descriptive challenge which don't seem to have normative analogues. It therefore seems worth trying to identify a seperate set of descriptive subproblems, and vaguely categorise descriptive work according to which of them it gets at. I'll suggest some subproblems here, with a view to using them ...