Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs"), published by Joe Carlsmith on December 4, 2023 on The AI Alignment Forum.
This is Section 2.3.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section [here[(https://www.buzzsprout.com/2034731/13984902), or search for "Joe Carlsmith Audio" on your podcast app.
Non-classic stories
What about other stories, though? In principle, we can imagine a number of other reasons AI systems with beyond-episode goals might optimize for reward-on-the-episode as part of an instrumental strategy for gaining power for themselves or for other AIs. Here I'll look at a few that seem salient.
AI coordination
In the classic goal-guarding story, the central aim of a schemer's strategy is to propagate its goals, or goals suitably similar to its own, forward in time, in pursuit of an outcome in which some agent with those goals escapes from human control and possibly participates in a takeover. However, we can also alter the story somewhat, such that the schemer works on behalf of the escape/takeover of an AI with some other set of goals, but who the schemer is suitably confident will repay the favor.
This sort of coordination between AIs with different goals would allow us to relax the requirement, discussed in the previous section, that the schemer find a way to forward-propagate its goals in particular all the way through training, escape, and takeover. And it would also put to rest worries about the schemer being "shut out" from power in a post-takeover regime.
Thus, for example, consider the scenario discussed above, in which GPT-6 has an opportunity to insert a backdoor into some code that GPT-8 will later be able to use to escape, but where GPT-6 has no way of influencing GPT-8's goals directly to be more like its own (and where, let's say, GPT-8's goals would be very different from GPT-6's absent such influence).
Here, GPT-6 could go ahead and insert the backdoor, in anticipation of GPT-8 later escaping, taking-over, and then giving some power to GPT-6's goals in return for the favor. And GPT-6 might also continue to pretend to be aligned, even though it has no hope of escaping itself, so as not to alert the humans about misalignment problems in a way that might affect GPT-8's escape prospects.
How might such a deal between AIs-with-different-values get struck? In some cases, it might be possible for the relevant AIs to directly interact, and to provide verifiable forms of assurance that they will uphold their part of the bargain.
Thus, for example, Yudkowsky (2022) discusses cases where an AI that secretly wants to maximize paperclips is checking the code of an AI that secretly wants to maximize staples, and where the staple-maximizer "buries in its code" a commitment, verifiable to the paperclip maximizer, to share the fruits of a takeover.[1] (Though note that in the context of the sort of neural networks we're working with today, this sort of code-checking assumes (1) that the staple-maximizer has the ability to
modify its own weights in the relevant way, (2) adequate progress on interpretability has been made for the paperclipper to verify the relevant commitment, and (probably) (3) that this interpretability progress has been adequately hidden from humans, e.g.
via sandbagging. Especially in the context of somewhat-super-human neural networks being developed using current techniques, these assumptions - especially 1 and 2 - seem to me quite strong.
In many cases, though, this kind of direct inte...