The Nonlinear Library: Alignment Forum

AF - The counting argument for scheming (Sections 4.1 and 4.2 of "Scheming AIs") by Joe Carlsmith


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The counting argument for scheming (Sections 4.1 and 4.2 of "Scheming AIs"), published by Joe Carlsmith on December 6, 2023 on The AI Alignment Forum.
This is Sections 4.1 and 4.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app.
Arguments for/against scheming that focus on the final properties of the model
Various arguments for/against scheming proceed by comparing the final properties of different model classes (e.g. schemers, training saints, reward-on-the-episode seekers, etc) according to how well they perform according to some set of criteria that we imagine SGD is selecting for.
What is SGD selecting for? Well, one obvious answer is: high reward. But various of the arguments I'll consider won't necessarily focus on reward directly. Rather, they'll focus on other criteria, like the "simplicity" or the "speed" of the resulting model. However, we can distinguish between two ways these criteria can enter into our predictions about what sort of model SGD will select.
Contributors to reward vs. extra criteria
On the first frame, which I'll call the "contributors to reward" frame, we understand criteria like "simplicity" and "speed" as relevant to the model SGD selects only insofar as they are relevant to the amount of reward that a given model gets. That is, on this frame, we're really only thinking of SGD as selecting for one thing - namely, high reward performance - and simplicity and speed are relevant insofar as they're predictive of high reward performance.
Thus, an example of a "simplicity argument," given in this frame, would be: "a schemer can have a simpler goal than a training saint, which means that it would be able to store its goal using fewer parameters, thereby freeing up other parameters that it can use for getting higher reward."
This frame has the advantage of focusing, ultimately, on something that we know SGD is indeed selecting for - namely, high reward. And it puts the relevance of simplicity and speed into a common currency - namely, contributions-to-reward.
By contrast: on the second frame, which I'll call the "extra criteria" frame, we understand these criteria as genuinely additional selection pressures, operative even independent of their impact on reward. That is, on this frame, SGD is selecting both for high reward, and for some other properties - for example, simplicity.
Thus, an example of a "simplicity argument," given in this frame, would be: "a schemer and a training saint would both get high reward in training, but a schemer can have a simpler goal, and SGD is selecting for simplicity in addition to reward, so we should expect it to select a schemer."
The "extra criteria" frame is closely connected to the discourse about "inductive biases" in machine learning - where an inductive bias, roughly, is whatever makes a learning process prioritize one solution over another other than the observed data (see e.g. Box 2 in Battaglia et al (2018) for more).
Thus, for example, if two models would perform equally well on the training data, but differ in how they would generalize to an unseen test set, the inductive biases would determine which model gets selected. Indeed, in some cases, a model that performs worse on the training data might get chosen because it was sufficiently favored by the inductive biases (as analogy: in science, sometimes a simpler theory is preferred despite the fact that it provides a worse fit with the data).
Ultimately, the differences...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners