Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Arguments for/against scheming that focus on the path SGD takes (Section 3 of "Scheming AIs"), published by Joe Carlsmith on December 5, 2023 on The AI Alignment Forum.
This is Section 3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app.
Arguments for/against scheming that focus on the path that SGD takes
In this section, I'll discuss arguments for/against scheming that focus more directly on the path that SGD takes in selecting the final output of training.
Importantly, it's possible that these arguments aren't relevant. In particular: if SGD would actively favors or disfavor schemers, in some kind "direct comparison" between model classes, and SGD will "find a way" to select the sort of model it favors in this sense (for example, because sufficiently high-dimensional spaces make such a "way" available),[1] then enough training will just lead you to whatever model SGD most favors, and the "path" in question won't really matter.
In the section on comparisons between the final properties of the different models, I'll discuss some reasons we might expect this sort of favoritism from SGD. In particular: schemers are "simpler" because they can have simpler goals, but they're "slower" because they need to engage in various forms of extra instrumental reasoning - e.g., in deciding to scheme, checking whether now is a good time to defect, potentially engaging in and covering up efforts at "early undermining," etc (though note that the need to perform extra instrumental reasoning, here, can manifest as additional complexity in the algorithm implemented by a schemer's weights, and hence as a "simplicity cost", rather than as a need to "run that algorithm for a longer time").[2] I'll say much more about this below.
Here, though, I want to note that if SGD cares enough about properties like simplicity and speed, it could be that SGD will typically build a model with long-term power-seeking goals first, but then even if this model tries a schemer-like strategy (it wouldn't necessarily do this, in this scenario, due to foreknowledge of its failure), it will get relentlessly ground down into a reward-on-the-episode seeker due to the reward-on-the-episode seeker's speed advantage. Or it could be that SGD will typically build a reward-on-the-episode seeker first, but that model will be relentlessly ground down into a schemer due to SGD's hunger for simpler goals.
In this section, I'll be assuming that this sort of thing doesn't happen. That is, the order in which SGD builds models can exert a lasting influence on where training ends up. Indeed, my general sense is that discussion of schemers often implicity assumes something like this - e.g., the thought is generally that a schemer will arise sufficiently early in training, and then lock itself in after that.
The training-game-independent proxy-goals story
Recall the distinction I introduced above, between:
Training-game-independent beyond-episode goals, which arise independently of their role in training-gaming, but then come to motivate training-gaming, vs.
Training-game-dependent beyond-episode goals, which SGD actively creates in order to motivate training gaming.
Stories about scheming focused on training-game-independent goals seem to me more traditional. That is, the idea is:
Because of [insert reason], the model will develop a (suitably ambitious) beyond-episode goal correlated with good performance in training (in a manner that doesn't route v...