Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Build a Causal Decision Theorist, published by michaelcohen on March 9, 2023 on The AI Alignment Forum.
I'll argue here that we should make an aligned AI which is a causal decision theorist.
Son-of-CDT
Suppose we are writing code for an agent with an action space A and an observation space O. The code determines how actions will be selected given the prior history of actions and observations. If the only way that our choice of what code to write can affect the world is through the actions that will be selected by the agent running this code, then the best we can do (for a given utility function that we know how to write down) is to make this agent a causal decision theorist. If our choice of what code to use can affect the world in other ways, all bets are off. The best choice of what code to put in the agent depends on details of the world we find ourselves in.
Therefore, if we run a CDT agent, it may well conclude that continuing to operate is not the best way to convert energy into expected utility. It may take actions to cause the following to happen: a) the program which computes its own actions is terminated, and b) some new program is run on the same computer to output actions given the interaction history. The new program that gets run (if indeed such a thing happens) is called Son-of-CDT. Given the state of the world, which entails various ways in which the source code of an agent might affect the outside world besides through the actions that the code outputs, Son-of-CDT is the best program to run for maximizing expected utility. The original CDT agent chooses the program that meets this specification. In general, this will not have anything remotely like a nice, simple closed form. If there are agents out there with vendettas against certain agent-programs, it will take that into account.
Vendettas against Son-of-CDT?
CDT agents can be bullied. I believe the MIRI view is that Son-of-CDT will be bullied as well. Suppose there is an ultimatum game, where agent A offers at most $10 to agent B, and if agent B accepts, then agent A gets $10 minus the amount they offered. Otherwise, both get nothing. A competent agent in the position of agent B able to make a credible commitment (perhaps by revealing its source code) would commit to accept nothing less than $9.99, if agent A is a CDT agent. This would work out for the competent agent, because the CDT agent would see all this, and realize it could be one penny richer if it offers $9.99.
Eliezer claims that a "[competent] agent [chooses] to reject offers short of $9.99 from [the CDT agent's] offspring. (Original: "the LDT agent's choice to reject offers short of $9.99 from its offspring").
In my sketch above of the creation of Son-of-CDT, I include a detail that it would be housed in the same computer that ran the original agent, but this needn't be the case. It could be run anywhere in the world. The CDT agent could take any sort of actions that would cause Son-of-CDT to come into existence some time in the future somewhere in the world. There is no clear way to distinguish the "offspring" of an agent, given that an agent's actions can cause other agents to come into existence in arbitrary ways. For a competent agent to reject offers short of $9.99 from the "offspring" of a CDT agent, it would have to reject offers short of $9.99 from all agents that came into being after the existence of a single CDT agent. It would have to bully everyone.
After a CDT agent with a certain utility function comes into being, if there exists an accessible future in which a competent agent optimizes that utility function (where "accessible" is with respect to the action space of the CDT agent), then the CDT agent will access that future by taking the appropriate actions, and that competent agent will come into being. If it is true t...