Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The case for removing alignment and ML research from the training dataset, published by beren on May 30, 2023 on LessWrong.
This is crossposted from my personal blog.
Many possible sources of danger from AI stem from the AI knowing a lot about itself, us, and the potential auditing and alignment methods we might try to use to either align the AI or to detect misaligned models. Suppose we have an AGI that contains a misaligned mesaoptimizer playing the training game. To succeed at this, the mesaoptimizer needs to 'know' essentially all of the following facts:
1.) That it is part of an AI model being trained and what that means,
2.) How training by SGD works,
3.) Probably some indication of the architecture of the model it exists in,
4.) Difference between deployment and training environments as well as that 'deployment' is likely to occur at some point in its future and be much less constrained than training,
5.) The kinds of things that human overseers encourage and are looking out for.
Similarly with deception. A misaligned power-seeking AI, to successfully be able to deceive us, even when we are running evals to detect it, needs to have almost perfect knowledge of the following:
1.) That it is an AI model with goals misaligned from that of its creators,
2.) That if we detect its misaligned 'true' behaviour that we will probably turn it off or otherwise thwart its goals,
3.) That the situations it observes itself in are almost certainly training or evaluation situations where it should behave nicely,
4.) What the 'nice' and aligned behaviour humans expect actually is,
5.) How to tell when it is in 'real deployment' mode so it can start amassing power,
6.) If we are using interpretability or other probing based methods to detect deception and monitor its internal thoughts, it needs to know that these are possible, and approximately how they work to have a good chance of thwarting us by applying techniques from circumventing interpretability.
Without a large degree of this kind of knowledge, pulling off successful deception or sharp left turn maneuver against a moderately competent set of human and AI overseers seems very challenging and unlikely to succeed on the first shot (and an individual misaligned AI has only one shot).
Almost all of this information is not impossible to derive a-priori from a sensory stream and knowledge base that comprises most of the internet. Definitely its current training data contains much (but not all) of this information. However, this information is, almost certainly, not derivable from scratch. This means that an incredibly low hanging fruit for safety is simply to remove the documents containing this information from the training set the AI is exposed to! The less the AGI knows specifically about itself, about us, and about our alignment strategies and countermeasures, the less likely it will be able to successfully evade and counteract them.
Naturally, there is a tradeoff between the capabilities of the model and the information we allow it access to. We want AI systems to be good and useful to people. Ultimately we do want to build an AGI which is aligned. An AI that knows literally nothing is very safe, but also useless. However, it seems like the pareto frontier near our current position is extremely flat, such that we can maybe get large safety gains at almost no general capability cost. Specifically, my proposal is to remove from the training set documents which could provide knowledge about the models own internal workings (to help prevent self-awareness) and our alignment techniques and countermeasures for detecting and neutralizing misaligned AGIs.
A naive solution would simply remove a.) lesswrong, alignment forum, and all other major hubs of alignment discussion and b.) remove ML arxiv from pretraining datasets. This is t...