The Nonlinear Library: Alignment Forum

AF - Simplicity arguments for scheming (Section 4.3 of "Scheming AIs") by Joe Carlsmith


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simplicity arguments for scheming (Section 4.3 of "Scheming AIs"), published by Joe Carlsmith on December 7, 2023 on The AI Alignment Forum.
This is Section 4.3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app.
Simplicity arguments
The strict counting argument I've described is sometimes presented in the context of arguments for expecting schemers that focus on "simplicity."[1] Let's turn to those arguments now.
What is "simplicity"?
What do I mean by "simplicity," here? In my opinion, discussions of this topic are often problematically vague - both with respect to the notion of simplicity at stake, and with respect to the sense in which SGD is understood as selecting for simplicity.
The notion that Hubinger uses, though, is the length of the code required to write down the algorithm that a model's weights implement. That is: faced with a big, messy neural net that is doing X (for example, performing some kind of induction), we imagine re-writing X in a programming language like python, and we ask how long the relevant program would have to be.[2] Let's call this "re-writing simplicity."[3]
Hubinger's notion of simplicity, here, is closely related to measures of algorithmic complexity like "Kolmogorov complexity," which measure the complexity of a string by reference to the length of the shortest program that outputs that string when fed into a chosen Universal Turing Machine (UTM).
Indeed, my vague sense is that certain discussions of simplicity in the context of computer science often implicitly assume what I've called "simplicity realism" - a view on which simplicity in some deep sense an objective thing, ultimately independent of e.g. your choice of programming language or UTM, but which different metrics of simplicity are all tracking (albeit, imperfectly).
And perhaps this view has merit (for example, my impression is that different metrics of complexity often reach similar conclusions in many cases - though this could have many explanations). However, I don't, personally, want to assume it. And especially absent some objective sense of simplicity, it becomes more important to say which particular sense you have in mind.
Another possible notion of simplicity, here, is hazier - but also, to my mind, less theoretically laden.
On this notion, the simplicity of an algorithm implemented by a neural network is defined relative to something like the number of parameters the neural network uses to encode the relevant algorithm.[6] That is, instead of imagining re-writing the neural network's algorithm in some other programming language, we focus directly on the parameters the neural network itself is recruiting to do the job, where simpler programs use fewer parameters.
Let's call this "parameter simplicity." Exactly how you would measure "parameter simplicity" is a different question, but it has the advantage of removing one layer of theoretical machinery and arbitrariness (e.g., the step of re-writing the algorithm in an arbitrary-seeming programming language), and connecting more directly with a "resource" that we know SGD has to deal with (e.g., the parameters the model makes available). For this reason, I'll often focus on "parameter simplicity" below.
I'll also flag a way of talking about "simplicity" that I won't emphasize, and which I think muddies the waters here considerably: namely, equating simplicity fairly directly with "higher prior probability." Thus, for example, faced w...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners