Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Could We Automate AI Alignment Research?, published by Stephen McAleese on August 10, 2023 on The AI Alignment Forum.
Summary
Creating an AI that can do AI alignment research (an automated alignment researcher) is one part of OpenAI's three-part alignment plan and a core goal of their Superalignment team.
The automated alignment researcher will probably involve an advanced language model more capable than GPT-4 and possibly with human-level intelligence. Current alignment approaches such as recursive reward modeling could be used to align the automated alignment researcher.
A key problem with creating an automated alignment researcher is a chicken-and-egg problem where alignment is needed before the AI can be safely created and the AI is needed to create the alignment solution. One solution to this problem is bootstrapping involving an initial alignment solution and iterative improvements to the AI's alignment and capabilities.
I argue that the success of the plan depends on aligning the automated alignment researcher being easier than aligning AGI.
Automating alignment research could be a form of differential technological development and could be a Pivotal Act.
Creating a human-level automated alignment researcher seems risky because it is a form of AGI and therefore has similar risks.
Related ideas include Cyborgism and creating whole-brain emulations of AI alignment researchers before creating AGI.
Key cruxes: the level of capabilities needed to solve alignment and the difficulty of aligning an AI with the goal of solving alignment compared to other more open-ended goals.
Introduction
OpenAI's alignment plan has three pillars: improving methods for aligning models using human feedback (e.g. RLHF), training models to assist human evaluation (e.g. recursive reward modeling), and training AI systems to do alignment research.
I recently listened to the AXRP podcast episode on Jan Leike and the OpenAI superalignment team which inspired me to write this post. My goal is to explore and evaluate OpenAI's third approach to solving alignment: creating an AI model that can do AI alignment research or automated AI alignment researcher to solve the alignment problem. I'm interested in exploring the approach because it seems to be the newest and most speculative part of their plan and my goal in this post is to try and explain how it could work and the key challenges that would need to be solved to make it work.
At first, the idea may sound unreasonable because there's an obvious chicken-and-egg problem: what's the point in creating an automated alignment researcher to solve alignment if we need to solve alignment first to create the automated alignment researcher? It's easy to dismiss the whole proposal given this problem but I'll explain how there is a way around it. I'll also go through the risks associated with the plan.
Why would an automated alignment researcher be valuable?
Scalable AI alignment is a hard unsolved problem
In their alignment plan, OpenAI acknowledges that creating an indefinitely scalable solution to AI alignment is currently an unsolved problem and will probably be difficult to solve.
Reinforcement learning with human feedback (RLHF) is OpenAI's current alignment approach for recent systems such as ChatGPT and GPT-4 but it probably wouldn't scale to aligning superintelligence so new ideas are needed. One major problem is that RLHF depends on human evaluators understanding and evaluating the outputs of AIs. But once AI's are superintelligent, human evaluators probably won't be able to understand and evaluate their outputs. RLHF also has other problems such as reward hacking and can incentivize deception. See this paper for a more detailed description of problems with RLHF.
Recursive reward modeling (RRM) improves RLHF by using models to help humans evalu...