The Nonlinear Library

AF - When can we trust model evaluations? by Evan Hubinger


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: When can we trust model evaluations?, published by Evan Hubinger on July 28, 2023 on The AI Alignment Forum.
Thanks to Joe Carlsmith, Paul Christiano, Richard Ngo, Kate Woolverton, and Ansh Radhakrishnan for helpful conversations, comments, and/or feedback.
In "Towards understanding-based safety evaluations," I discussed why I think evaluating specifically the alignment of models is likely to require mechanistic, understanding-based evaluations rather than solely behavioral evaluations. However, I also mentioned in a footnote why I thought behavioral evaluations would likely be fine in the case of evaluating capabilities rather than evaluating alignment:
However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment.
That's because while I think it would be quite tricky for a deceptively aligned AI to sandbag its capabilities when explicitly fine-tuned on some capabilities task (that probably requires pretty advanced gradient hacking), it should be quite easy for such a model to pretend to look aligned.
In this post, I want to try to expand a bit on this point and explain exactly what assumptions I think are necessary for various different evaluations to be reliable and trustworthy. For that purpose, I'm going to talk about four different categories of evaluations and what assumptions I think are needed to make each one go through.
Before I do that, however, I want to go over the distinction between capabilities evaluations and alignment evaluations, as it'll be an important one throughout the post. Specifically:
A capabilities evaluation is a model evaluation designed to test whether a model could do some task if it were trying to. For example: if the model were actively trying to autonomously replicate, would it be capable of doing so?
An alignment evaluation is a model evaluation designed to test under what circumstances a model would actually try to do some task. For example: would a model ever try to convince humans not to shut it down?
In my opinion, if you want to craft a good governance scheme around model evaluations, you're going to need both capabilities and alignment evaluations. For example, a very simplified scheme here could be something like:
Do a bunch of capabilities evaluations for various risks:
If we believe the scaling laws for our capabilities evaluations are such that the next model to be trained won't be capable of causing a catastrophe even if it were trying to, then it's fine to train.
If we believe that the next model might be capable of causing a catastrophe if it were trying to, then do a bunch of alignment evals for whether the model would try to cause a catastrophe:
If we believe that the scaling laws for our alignment evaluations are such that we're confident that the next model will be aligned and won't try to cause a catastrophe, then it's fine to train.
Otherwise don't train any larger models.
That's not to say that the above scheme is the best one (or even a good one) - I only provide it as an example of what the interplay between capabilities evaluations and alignment evaluations might look like.
Now we'll look at different evaluations and see:
What assumptions are necessary for them to be trustworthy.
How they can help us in evaluating capabilities and/or alignment.
1. Behavioral Non-Fine-Tuning Evaluations
Key assumption: no evaluation gaming.
We'll start with the most straightforward type of evaluation, behavioral non-fine-tuning evaluations - that is, just directly evaluating the model on some specific dataset, without any task-specific fine-tuning. Most current evaluations fall into this categ...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings