January 01, 2023

LW - My thoughts on OpenAI's alignment plan by Akash

33 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My thoughts on OpenAI's alignment plan, published by Akash on December 30, 2022 on LessWrong.

Epistemic Status: This is my first attempt at writing up my thoughts on an alignment plan. I spent about a week on it.

I’m grateful to Olivia Jimenez, Thomas Larsen, and Nicholas Dupuis for feedback.

A few months ago, OpenAI released its plan for alignment. More recently, Jan Leike (one of the authors of the original post) released a blog post about the plan, and Eliezer & Nate encouraged readers to write up their thoughts.

In this post, I cover some thoughts I have about the OpenAI plan. This is a long post, and I’ve divided it into a few sections. Each section gets increasingly more specific and detailed. If you only have ~5 minutes, I suggest reading section 1 and skimming section 2.

The three sections:

An overview of the plan and some of my high-level takes (here)

Some things I like about the plan, some concerns, and some open questions (here)

Specific responses to claims in OpenAI’s post and Jan’s post (here)

Section 1: High-level takes

Summary of OpenAI’s Alignment Plan

As I understand it, OpenAI’s plan involves using reinforcement learning from human feedback and recursive reward modeling to build AI systems that are better than humans at alignment research. Open AI’s plan is not aiming for a full solution to alignment (that scales indefinitely or that could work on a superintelligent system). Rather, the plan is intended to (a) build systems that are better at alignment research than humans (AI assistants), (b) use these AI assistants to accelerate alignment research, and (c) use these systems to build/align more powerful AI assistants.

Six things that need to happen for the plan to work

For the plan to work, I think it needs to get through the following 6 steps:

OpenAI builds LLMs that can help us with alignment research

OpenAI uses those models primarily to help us with alignment research, and we slow down/stop capabilities research when necessary

OpenAI has a way to evaluate the alignment strategies proposed by the LLM

OpenAI has a way to determine when it’s OK to scale up to more powerful AI assistants (and OpenAI has policies in place that prevent people from scaling up before it’s OK)

Once OpenAI has a highly powerful and aligned system, they do something with it that gets us out of the acute risk period

Once we are out of the acute risk period, OpenAI has a plan for how to use transformative AI in ways that allow humanity to “fulfill its potential” or “achieve human flourishing” (and they have a plan to figure out what we should even be aiming for)

If any of these steps goes wrong, I expect the plan will fail to avoid an existential catastrophe. Note that steps 1-5 must be completed before another actor builds an unaligned AGI.

Here’s a table that lists each step and the extent to which I think it’s covered by the two posts about the OpenAI plan:

Step

Extent to which this step is discussed in the OpenAI plan

Build an AI assistant

Adequate; OpenAI acknowledges that this is their goal, mentions RLHF and RRM as techniques that could help us achieve this goal, and mentions reasonable limitations

Use the assistant for alignment research (and slow down capabilities)

Inadequate; OpenAI mentions that they will use the assistant for alignment research, but they don’t describe how they will slow down capabilities research if necessary or how they plan to shift the current capabilities-alignment balance

Evaluate the alignment strategies proposed by the assistant

Unclear; OpenAI acknowledges that they will need to evaluate strategies from the AI assistant, though the particular metrics they mention seem unlikely to detect alignment concerns that may only come up at high capability levels (e.g., deception, situational awareness)

Figure out when it’s OK to scale up to more ...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

January 01, 2023

LW - My thoughts on OpenAI's alignment plan by Akash

33 minutes

Epistemic Status: This is my first attempt at writing up my thoughts on an alignment plan. I spent about a week on it.

I’m grateful to Olivia Jimenez, Thomas Larsen, and Nicholas Dupuis for feedback.

The three sections:

An overview of the plan and some of my high-level takes (here)

Some things I like about the plan, some concerns, and some open questions (here)

Specific responses to claims in OpenAI’s post and Jan’s post (here)

Section 1: High-level takes

Summary of OpenAI’s Alignment Plan

Six things that need to happen for the plan to work

For the plan to work, I think it needs to get through the following 6 steps:

OpenAI builds LLMs that can help us with alignment research

OpenAI uses those models primarily to help us with alignment research, and we slow down/stop capabilities research when necessary

OpenAI has a way to evaluate the alignment strategies proposed by the LLM

OpenAI has a way to determine when it’s OK to scale up to more powerful AI assistants (and OpenAI has policies in place that prevent people from scaling up before it’s OK)

Once OpenAI has a highly powerful and aligned system, they do something with it that gets us out of the acute risk period

If any of these steps goes wrong, I expect the plan will fail to avoid an existential catastrophe. Note that steps 1-5 must be completed before another actor builds an unaligned AGI.

Here’s a table that lists each step and the extent to which I think it’s covered by the two posts about the OpenAI plan:

Step

Extent to which this step is discussed in the OpenAI plan

Build an AI assistant

Adequate; OpenAI acknowledges that this is their goal, mentions RLHF and RRM as techniques that could help us achieve this goal, and mentions reasonable limitations

Use the assistant for alignment research (and slow down capabilities)

Evaluate the alignment strategies proposed by the assistant

Figure out when it’s OK to scale up to more ...

...more

Share LW - My thoughts on OpenAI's alignment plan by Akash

Sign up to save your podcasts

LW - My thoughts on OpenAI's alignment plan by Akash

LW - My thoughts on OpenAI's alignment plan by Akash