November 10, 2023

AF - We have promising alignment plans with low taxes by Seth Herd

11 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We have promising alignment plans with low taxes, published by Seth Herd on November 10, 2023 on The AI Alignment Forum.

Epistemic status: I'm sure these plans have advantages relative to other plans. I'm not sure they're adequate to actually work, but I think they might be.

With good enough alignment plans, we might not need coordination to survive. If alignment taxes are low enough, we might expect most people developing AGI to adopt them voluntarily. There are two alignment plans that seem very promising to me, based on several factors, including ease of implementation, and applying to fairly likely default paths to AGI. Neither has received much attention. I can't find any commentary arguing that they wouldn't work, so I'm hoping to get them more attention so they can be considered carefully and either embraced or rejected.

Even if these plans[1] are as promising as I think now, I'd still give p(doom) in the vague 50% range. There is plenty that could go wrong.[2]

There's a peculiar problem with having promising but untested alignment plans: they're an excuse for capabilities to progress at full speed ahead. I feel a little hesitant to publish this piece for that reason, and you might feel some hesitation about adopting even this much optimism for similar reasons. I address this problem at the end.

The plans

Two alignment plans stand out among the many I've found. These seem more specific and more practical than others. They are also relatively simple and obvious plans for the types of AGI designs they apply to. They have received very little attention since being proposed recently. I think they deserve more attention.

The first is Steve Byrnes'

Plan for mediocre alignment of brain-like [model-based RL] AGI. In this approach, we evoke a set of representations in a

learning subsystem, and set the weights from there to the

steering or critic subsystems. For example, we ask the agent to "think about human flourishing" and then freeze the system and set high weights between the active units in the learning system/world model and the steering system/critic units. The system now ascribes high value to the distributed concept of human flourishing. (at least as it understands it). Thus, the agent's knowledge is used to define a goal we like.

This plan applies to all RL systems with a critic subsystem, which includes most powerful RL systems.[3] RL agents (including loosely brain-like systems of deep networks) seem like one very plausible route to AGI. I personally give them high odds of achieving AGI if

language model cognitive architectures (LMCAs) don't achieve it first.

The second promising plan might be called natural language alignment, and it applies to language model cognitive architectures and other language model agents. The most complete writeup I'm aware of is

mine. This plan similarly uses the agent's knowledge to define goals we like. Since that sort of agent's knowledge is defined in language, this takes the form of stating goals in natural language, and constructing the agent so that its system of self-prompting results in taking actions that pursue those goals. Internal and external review processes can improve the system's ability to effectively pursue both practical and alignment goals.

John Wentworth's plan

How To Go From Interpretability To Alignment: Just Retarget The Search is similar. It applies to a third type of AGI, a mesa-optimizer that emerges through training. It proposes using interpretability methods to identify the representations of goals in that mesa-optimizer; identifying representations of what we want the agent to do; and pointing the former at the latter.

This plan seems more technically challenging, and I personally don't think an emergent mesa-optimizer in a predictive foundation model is a likely route to AGI. But this plan shares...

...more

View all episodes

By The Nonlinear Fund

November 10, 2023

AF - We have promising alignment plans with low taxes by Seth Herd

11 minutes

Epistemic status: I'm sure these plans have advantages relative to other plans. I'm not sure they're adequate to actually work, but I think they might be.

Even if these plans[1] are as promising as I think now, I'd still give p(doom) in the vague 50% range. There is plenty that could go wrong.[2]

The plans

The first is Steve Byrnes'

Plan for mediocre alignment of brain-like [model-based RL] AGI. In this approach, we evoke a set of representations in a

learning subsystem, and set the weights from there to the

language model cognitive architectures (LMCAs) don't achieve it first.

John Wentworth's plan

This plan seems more technically challenging, and I personally don't think an emergent mesa-optimizer in a predictive foundation model is a likely route to AGI. But this plan shares...

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - We have promising alignment plans with low taxes by Seth Herd

Sign up to save your podcasts

AF - We have promising alignment plans with low taxes by Seth Herd

AF - We have promising alignment plans with low taxes by Seth Herd

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast