February 19, 2024

AF - Protocol evaluations: good analogies vs control by Fabien Roger

21 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Protocol evaluations: good analogies vs control, published by Fabien Roger on February 19, 2024 on The AI Alignment Forum.

Let's say you want to use AIs that are capable of causing x-risk. You have a plan that describes how to train, validate and deploy AIs. For example, you could use pretraining+RLHF, check that it doesn't fall for honeypots and never inserts a single backdoor in your coding validation set, and have humans look at the deployment outputs GPT-4 finds the most suspicious. How do you know if this plan allows you to use these potentially dangerous AIs safely?

Thanks to Ryan Greenblatt and Buck Shlegeris who provided the original ideas for this post, and thanks to Jean-Stanislas Denain, Bogdan Ionut Cirstea, Ansh Radhakrishnan, and Charbel-Raphael Segerie for useful feedback on the draft of this post.

Evaluate safety using very good analogies for the real situation?

If you want to evaluate a protocol (a set of techniques you use to train, validate, and deploy a model), then you can see how well it works in a domain where you have held-out validation data that you can use to check if your protocol works. That is, you run your exact protocol on this domain without allowing the protocol access to the held-out validation data. Then, you just assess how well you did with this held-out validation. This sort of approach is sometimes called

sandwiching. (Here we are using sandwiching in a broad way that covers the evaluation of aspects of protocols, not just evaluation of oversight.)

One approach for obtaining held-out validation data is to restrict the quality of the labor that goes into the protocol: the

Measuring progress in scalable oversight paper uses non-expert humans, and the

weak-to-strong generalization paper uses a small model. Then, you can use expert labels to check how well the protocol did.

A big flaw of this approach is that the analogy between [protocol with weak labor, human expert] and [protocol with human expert labor, actual goodness] might be quite poor. In particular, it's not clear how weak the weak labor should be. And if the analogy is poor, the AI might not generalize and behave in the same ways (even if it is not scheming - more on that in the next section).

For now, the weakness of this sort of analogy doesn't bite very hard because there is an even bigger difference between our current experiments and real dangerous AI use cases: we don't have (x-risk) dangerous AIs yet. Thus, the actual analogy is between [weak AIs, protocol with weak labor, human expert] and [strong future AIs, protocol with human expert labor, actual goodness].

But when we actually have very powerful AIs, it would be great to have much better analogies, such that our evaluations actually tell us how well our protocol works. And there is a way to build great analogies!

The string theory analogy

Here is how you could evaluate how good a protocol is at making AI do alignment research:

Remove string theory from the training data;

Use the same training process you would have used to do alignment research, but replace alignment research with string theory research, using humans who know as much string theory as the best alignment researchers know about alignment (i.e. not much);

Check if the AI was actually helpful to do string theory using the withheld data.

You can run this experiment on the actual AI you were about to fine-tune to help you do alignment research, using the same protocol! We call it task sandwiching, when you use your most powerful model in very analogous domains.

The key difference with regular sandwiching is that the labor that goes into the protocol is at the same difficulty level as the labor we want to use in real use cases - but we carefully chose domains where we have even higher quality labor (because humans have spent an enormous amount of tim...

...more

View all episodes

By The Nonlinear Fund

February 19, 2024

AF - Protocol evaluations: good analogies vs control by Fabien Roger

21 minutes

Evaluate safety using very good analogies for the real situation?

sandwiching. (Here we are using sandwiching in a broad way that covers the evaluation of aspects of protocols, not just evaluation of oversight.)

One approach for obtaining held-out validation data is to restrict the quality of the labor that goes into the protocol: the

Measuring progress in scalable oversight paper uses non-expert humans, and the

weak-to-strong generalization paper uses a small model. Then, you can use expert labels to check how well the protocol did.

The string theory analogy

Here is how you could evaluate how good a protocol is at making AI do alignment research:

Remove string theory from the training data;

Check if the AI was actually helpful to do string theory using the withheld data.

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - Protocol evaluations: good analogies vs control by Fabien Roger

Sign up to save your podcasts

AF - Protocol evaluations: good analogies vs control by Fabien Roger

AF - Protocol evaluations: good analogies vs control by Fabien Roger

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast