Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My thoughts on OpenAI's alignment plan, published by Akash on December 30, 2022 on LessWrong.
Epistemic Status: This is my first attempt at writing up my thoughts on an alignment plan. I spent about a week on it.
I’m grateful to Olivia Jimenez, Thomas Larsen, and Nicholas Dupuis for feedback.
A few months ago, OpenAI released its plan for alignment. More recently, Jan Leike (one of the authors of the original post) released a blog post about the plan, and Eliezer & Nate encouraged readers to write up their thoughts.
In this post, I cover some thoughts I have about the OpenAI plan. This is a long post, and I’ve divided it into a few sections. Each section gets increasingly more specific and detailed. If you only have ~5 minutes, I suggest reading section 1 and skimming section 2.
The three sections:
An overview of the plan and some of my high-level takes (here)
Some things I like about the plan, some concerns, and some open questions (here)
Specific responses to claims in OpenAI’s post and Jan’s post (here)
Section 1: High-level takes
Summary of OpenAI’s Alignment Plan
As I understand it, OpenAI’s plan involves using reinforcement learning from human feedback and recursive reward modeling to build AI systems that are better than humans at alignment research. Open AI’s plan is not aiming for a full solution to alignment (that scales indefinitely or that could work on a superintelligent system). Rather, the plan is intended to (a) build systems that are better at alignment research than humans (AI assistants), (b) use these AI assistants to accelerate alignment research, and (c) use these systems to build/align more powerful AI assistants.
Six things that need to happen for the plan to work
For the plan to work, I think it needs to get through the following 6 steps:
OpenAI builds LLMs that can help us with alignment research
OpenAI uses those models primarily to help us with alignment research, and we slow down/stop capabilities research when necessary
OpenAI has a way to evaluate the alignment strategies proposed by the LLM
OpenAI has a way to determine when it’s OK to scale up to more powerful AI assistants (and OpenAI has policies in place that prevent people from scaling up before it’s OK)
Once OpenAI has a highly powerful and aligned system, they do something with it that gets us out of the acute risk period
Once we are out of the acute risk period, OpenAI has a plan for how to use transformative AI in ways that allow humanity to “fulfill its potential” or “achieve human flourishing” (and they have a plan to figure out what we should even be aiming for)
If any of these steps goes wrong, I expect the plan will fail to avoid an existential catastrophe. Note that steps 1-5 must be completed before another actor builds an unaligned AGI.
Here’s a table that lists each step and the extent to which I think it’s covered by the two posts about the OpenAI plan:
Step
Extent to which this step is discussed in the OpenAI plan
Build an AI assistant
Adequate; OpenAI acknowledges that this is their goal, mentions RLHF and RRM as techniques that could help us achieve this goal, and mentions reasonable limitations
Use the assistant for alignment research (and slow down capabilities)
Inadequate; OpenAI mentions that they will use the assistant for alignment research, but they don’t describe how they will slow down capabilities research if necessary or how they plan to shift the current capabilities-alignment balance
Evaluate the alignment strategies proposed by the assistant
Unclear; OpenAI acknowledges that they will need to evaluate strategies from the AI assistant, though the particular metrics they mention seem unlikely to detect alignment concerns that may only come up at high capability levels (e.g., deception, situational awareness)
Figure out when it’s OK to scale up to more ...