Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Empirical work that might shed light on scheming (Section 6 of "Scheming AIs"), published by Joe Carlsmith on December 11, 2023 on The AI Alignment Forum.
This is Section 6 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast aapp.
Empirical work that might shed light on scheming
I want to close the report with a discussion of the sort of empirical work that might help shed light on scheming.[1] After all: ultimately, one of my key hopes from this report is that greater clarity about the theoretical arguments surrounding scheming will leave us better positioned to do empirical research on it - research that can hopefully clarify the likelihood that the issue arises in practice, catch it if/when it has arisen, and figure out how to prevent it from arising in the first place.
To be clear: per my choice to write the report at all, I also think there's worthwhile theoretical work to be done in this space as well. For example:
I think it would be great to formalize more precisely different understandings of the concept of an "episode," and to formally characterize the direct incentives that different training processes create towards different temporal horizons of concern.[2]
I think that questions around the possibility/likelihood of different sorts of AI coordination are worth much more analysis than they've received thus far, both in the context of scheming in particular, and for understanding AI risk more generally.
Here I'm especially interested in coordination between AIs with distinct value systems, in the context of human efforts to prevent the coordination in question, and for AIs that resemble near-term, somewhat-better-than-human neural nets rather than e.g. superintelligences with assumed-to-be-legible source code.
I think there may be interesting theoretical work to do in further characterizing/clarifying SGD's biases towards simplicity/speed, and in understanding the different sorts of "path dependence" to expect in ML training more generally.
I'd be interested to see more work clarifying ideas in the vicinity of "messy goal-directedness" and their relevance to arguments about schemers. I think a lot of people have the intuition that thinking of model goal-directedness as implemented by a "big kludge of heuristics" (as opposed to: something "cleaner" and more "rational agent-like") makes a difference here (and elsewhere). But I think people often aren't fully clear on the contrast they're trying to draw, and why it makes a difference, if it does.
More generally, any of the concepts/arguments in this report could be clarified and formalized further, other arguments could be formulated and examined, quantitative models for estimating the probability of scheming could be created, and so on.
Ultimately, though, I think the empirics are what will shed the most informative and consensus-ready light on this issue. So one of my favorite outcomes from someone reading this report would be the reader saying something like: "ah, I now understand the arguments for and against expecting scheming much better, and have had a bunch of ideas for how we can probe the issue empirically" - and then making it happen. Here I'll offer a few high-level suggestions in this vein, in the hopes of prompting future and higher-quality work (designing informative empirical ML experiments is not my area of expertise - and indeed, I'm comparatively ignorant of various parts of the literature relevant to the topics below)....