Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: The case for competitiveness, published by Evan Hubinger on February 6, 2023 on The AI Alignment Forum.
This is the third of seven posts in the Conditioning Predictive Models Sequence based on the forthcoming paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. We will be releasing posts gradually over the course of the next week or so to give people time to read and digest them as they come out.
3. The case for competitiveness
In addition to ensuring that we can condition predictive models safely, for such an approach to work as a way to actually reduce AI existential risk, we also need it to be the case that it is competitive—that is, that it doesn’t impose too much of an alignment tax. Following “How do we become confident in the safety of a machine learning system?” we’ll distinguish between two different aspects of competitiveness here that we’ll need to address:
Training rationale competitiveness [(Implementation competitiveness)]: how hard the training rationale [(getting the model we want)] is to execute. That is, a proposal should fail on training rationale competitiveness if its training rationale is significantly more difficult to implement—e.g. because of compute or data requirements—than competing alternatives.
Training goal competitiveness [(Performance competitiveness)]: whether, if successfully achieved, the training goal [(the model we want)] would be powerful enough to compete with other AI systems. That is, a proposal should fail on training goal competitiveness if it would be easily outcompeted by other AI systems that might exist in the world.
To make these concepts easier to keep track of absent the full training stories ontology, we’ll call training rationale competitiveness implementation competitiveness, since it describes the difficulty of implementing the proposal, and training goal competitiveness performance competitiveness, since it describes the achievable performance for the resulting model.
Implementation competitiveness
The most generally capable models today, large language models, seem to be well-described as predictive models. That may change, but we think it is also at least quite plausible that the first human-level AGI will be some sort of predictive model, likely similar in structure to current LLMs.
Furthermore, LLM pre-training in particular seems to be where most of the capabilities of the most advanced current models come from: the vast majority of compute spent training large language models is spent in pre-training, not fine-tuning. Additionally, our guess is that the fine-tuning that is done is best modeled as targeting existing capabilities rather than introducing entirely new capabilities.
Assuming that, after pre-training, LLMs are well-understood as predictive models, that suggests two possibilities for how to think about different fine-tuning regimes:
The fine-tuning resulted in a particular conditional of the original pre-trained predictive model.
The fine-tuning targeted the capabilities by turning the predictive model into one that is no longer well-understood as predictive.
In the first case, the conditioning predictive models approach would simply be a variation on the exact techniques currently used at the forefront of capabilities, making it hopefully implementation competitive by default.[1] The main way we think such an implementation competitiveness argument could fail is if the fine-tuning necessary to get the sort of conditionals we describe here is substantially harder than alternative fine-tuning paradigms.
In particular, we think it is likely the case that our proposed solutions will add some amount of o...