December 31, 2023

AF - AI Alignment Metastrategy by Vanessa Kosoy

12 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Alignment Metastrategy, published by Vanessa Kosoy on December 31, 2023 on The AI Alignment Forum.

I call "alignment strategy" the high-level approach to solving the technical problem[1]. For example, value learning is one strategy, while delegating alignment research to AI is another. I call "alignment metastrategy" the high-level approach to converging on solving the technical problem in a manner which is timely and effective. (Examples will follow.)

In a previous article, I summarized my criticism of prosaic alignment. However, my analysis of the associated metastrategy was too sloppy. I will attempt to somewhat remedy that here, and also briefly discuss other metastrategies, to serve as points of contrast and comparison.

Conservative Metastrategy

The conservative metastrategy follows the following algorithm:

As much as possible, stop all work on AI capability outside of this process.

Develop the mathematical theory of intelligent agents to a level where we can propose adequate alignment protocols with high confidence. Ideally, the theoretical problems should be solved in such order that results with direct capability applications emerge as late as possible.

Design and implement empirical tests of the theory that incur minimal risk in worlds in which the theory contains errors or the assumptions of the theory are violated in practice.

If the tests show problems, go back to step 2.

Proceed with incrementally more ambitious tests in the same manner, until you're ready to deploy an AI defense system.

This is my own favorite metastrategy. The main reason it can fail is if the unconservative research we failed to stop creates unaligned TAI before we can deploy an AI defense system (currently, we have a long way to go to complete step 2).

I think that it's pretty clear that a competent civilization would follow this path, since it seems like the only one which leads to a good long-term outcome without taking unnecessary risks[2]. Of course, in itself that is an insufficient argument to prove that, in our actual civilization, the conservative metastrategy is the best for those concerned with AI risk. But, it is suggestive.

Beyond that, I won't lay out the case for the conservative metastrategy here. The interested reader can turn to 1 2 3 4 5.

Incrementalist Metastrategy

The incrementalist metastrategy follows the following algorithm:

Find an advance in AI capability (by any means, including trial and error).

Find a way to align the new AI design (prioritizing solutions that you expect to scale further).

Validate alignment using a combination of empiricial tests and interpretability tools.

If validation fails, go back to step 2.

If possible, deploy an AI defense system using current level of capabilities.

Go to step 1.

This is (more or less) the metastrategy favored by adherents of prosaic alignment. In particular, this is what the relatively safety-conscious actors involved with leading AI labs present as their plan.

There are 3 main mutually reinforcing problems with putting our hopes in this metastrategy, which I discuss below. There are 2 aspects to each problem: the "design" aspect, which is what would happen if the best version of the incrementalist metastrategy was implemented, and the "implementation" aspect, which is what happens in AI labs in practice (even when they claim to follow incrementalist metastrategy).

Information Security

Design

If a new AI capability is found in step 1, and the knowledge is allowed to propagate, then irresponsible actors will continue to compound it with additional advances before the alignment problem is solved on the new level. Ideally, either the new capability should remain secret at least until the entire iteration is over, or government policy should prevent any actor from subverting the metastrategy, or some sufficient com...

...more

View all episodes

By The Nonlinear Fund

December 31, 2023

AF - AI Alignment Metastrategy by Vanessa Kosoy

12 minutes

Conservative Metastrategy

The conservative metastrategy follows the following algorithm:

As much as possible, stop all work on AI capability outside of this process.

Design and implement empirical tests of the theory that incur minimal risk in worlds in which the theory contains errors or the assumptions of the theory are violated in practice.

If the tests show problems, go back to step 2.

Proceed with incrementally more ambitious tests in the same manner, until you're ready to deploy an AI defense system.

Beyond that, I won't lay out the case for the conservative metastrategy here. The interested reader can turn to 1 2 3 4 5.

Incrementalist Metastrategy

The incrementalist metastrategy follows the following algorithm:

Find an advance in AI capability (by any means, including trial and error).

Find a way to align the new AI design (prioritizing solutions that you expect to scale further).

Validate alignment using a combination of empiricial tests and interpretability tools.

If validation fails, go back to step 2.

If possible, deploy an AI defense system using current level of capabilities.

Go to step 1.

Information Security

Design

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - AI Alignment Metastrategy by Vanessa Kosoy

Sign up to save your podcasts

AF - AI Alignment Metastrategy by Vanessa Kosoy

AF - AI Alignment Metastrategy by Vanessa Kosoy

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast