June 20, 2023

LW - Ban development of unpredictable powerful models? by TurnTrout

7 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ban development of unpredictable powerful models?, published by TurnTrout on June 20, 2023 on LessWrong.

I think that we should pause capabilities progress indefinitely, at least for the largest and most powerful models. But when should we unpause—How can we know that a model is safe to develop or deploy?

I tentatively propose an answer. In particular, I propose a strict sufficient condition which is well-defined, which seems to rule out lots of accident risk, and which seems to partially align industry incentives with real alignment progress. This idea is not ready to be implemented, but I am currently excited about it and suspect it's crisp enough to actually implement.

When should we allow developers to deploy an already-trained language model on some task, like code completion? (I'll get to training risks later.)

Suppose I claim to understand how the model works. I say "I know what goals my model is pursuing. I know it's safe." To test my claims, you give me some random prompt (like "In the year 2042, humans finally"), and then (without using other AIs), I tell you "the most likely token is unlocked with probability .04, the second-most likely is achieved with probability .015, and...", and I'm basically right. That happens over hundreds of diverse validation prompts.

This would be really impressive and is great evidence that I really know what I'm doing.

Proposed sufficient condition for deploying powerful LMs: The developers have to predict the next-token probabilities on a range of government-provided validation prompts, without running the model itself. To do so, the developers are not allowed to use helper AIs whose outputs the developers can't predict by this criterion. Perfect prediction is not required. Instead, there is a fixed log-prob misprediction tolerance, averaged across the validation prompts.

Benefits

Developers probably have to truly understand the model in order to predict it so finely. This correlates with high levels of alignment expertise, on my view. If we could predict GPT-4 generalization quite finely, down to the next-token probabilities, we may in fact be ready to use it to help understand GPT-5.

Incentivizes models which are more predictable. Currently we aren't directly taxing unpredictability. In this regime, an additional increment of unpredictability must be worth the additional difficulty with approval.

Robust to "the model was only slightly finetuned, fast-track our application please." If the model was truly only changed in a small set of ways which the developers understand, the model should still be predictable on the validation prompts.

Somewhat agnostic to theories of AI risk. We aren't making commitments about what evals will tend to uncover what abilities, or how long alignment research will take. The deadline is dynamic, and might even adapt to new AI paradigms (predictability seems general).

Partially incentivizes labs to do alignment research for us. Under this requirement, the profit-seeking move is to get better at predicting (and, perhaps, interpreting) model behaviors.

Drawbacks

There are several drawbacks. Most notably, this test seems extremely strict, perhaps beyond even the strict standards we demand of those looking to deploy potentially world-changing models. I'll discuss a few drawbacks in the next section.

Anticipated questions

If we pass this, no one will be able to train new frontier models for a long time.

Good.

But maybe "a long time" is too long. It's not clear that this criterion can be passed, even after deeply qualitatively understanding the model.

I share this concern. That's one reason I'm not lobbying to implement this as-is. Even solu-2l (6.3M params, 2-layer) is probably out of reach absent serious effort and solution of superposition.

Maybe there are useful relaxations which are both more attai...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

June 20, 2023

LW - Ban development of unpredictable powerful models? by TurnTrout

7 minutes

When should we allow developers to deploy an already-trained language model on some task, like code completion? (I'll get to training risks later.)

This would be really impressive and is great evidence that I really know what I'm doing.

Benefits

Partially incentivizes labs to do alignment research for us. Under this requirement, the profit-seeking move is to get better at predicting (and, perhaps, interpreting) model behaviors.

Drawbacks

Anticipated questions

If we pass this, no one will be able to train new frontier models for a long time.

Good.

But maybe "a long time" is too long. It's not clear that this criterion can be passed, even after deeply qualitatively understanding the model.

I share this concern. That's one reason I'm not lobbying to implement this as-is. Even solu-2l (6.3M params, 2-layer) is probably out of reach absent serious effort and solution of superposition.

Maybe there are useful relaxations which are both more attai...

...more

Share LW - Ban development of unpredictable powerful models? by TurnTrout

Sign up to save your podcasts

LW - Ban development of unpredictable powerful models? by TurnTrout

LW - Ban development of unpredictable powerful models? by TurnTrout