Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reflections on Deception & Generality in Scalable Oversight (Another OpenAI Alignment Review), published by Shoshannah Tekofsky on January 28, 2023 on LessWrong.
Just like you can test your skill in experimental design by reviewing existing experiments, you can test your skill in alignment by reviewing existing alignment strategies. Conveniently, Rob Bensinger, in name of Nate Soares and Eliezer Yudkowsky, recently posted a challenge to AI Safety researchers to review the OpenAI alignment plan written by Jan Leike, John Schulman, and Jeffrey Wu. I figured this constituted a test that might net me feedback from both sides of the rationalist-empiricist aisle. Yet, instead of finding ground-breaking arguments for or against scalable oversight to do alignment research, it seems Leike already knows what might go wrong — and goes ahead anyway.
Thus my mind became split between evaluating the actual alignment plan and modeling the disagreement between prominent clusters of researchers. I wrote up the latter in an informal typology of AI Safety Researchers, and continued my technical review below. The following is a short summary of the OpenAI alignment plan, my views on the main problems, and a final section on recommendations for red lining.
The Plan
First, align AI with human feedback, then get AI to assist in giving human feedback to AI, then get AI to assist in giving human feedback to AI that is generating solutions to the alignment problem. Except, the steps are not sequential but run in parallel. This is one form of Scalable Oversight.
Human feedback is Reinforcement Learning from Human Feedback (RLHF), the assisting AI is Iterated Distillation and Amplification (IDA) and Recursive Reward Modeling (RRM), and the AI that is generating solutions to the alignment problem is. still under construction.
The target is a narrow AI that will make significant progress on the alignment problem. The MVP is a theorem prover. The full product is AGI utopia.
Here is a graph.
OpenAI explains its strategy succinctly and links to detailed background research. This is laudable, and hopefully other labs and organizations will follow suit. My understanding is also that if someone came along with a better plan then OpenAI would pivot in a heart beat. Which is even more laudable. The transparency, accountability, and flexibility they display set a strong example for other organizations working on AI.
But the show must go on (from their point of view anyway) and so they are going ahead and implementing the most promising strategy that currently exists. Even if there are problems.
And boy, are there problems.
The Problems
Jan Leike discusses almost all objections to the OpenAI alignment plan on his blog. Thus below I will only highlight the two most important problems in the plan, plus two additional concerns that I have not seen discussed so far.
Alignment research requires general intelligence - If the alignment researcher AI has enough general intelligence to make breakthrough discoveries in alignment, then you can't safely create it without already having solved alignment. Yet, Leike et al. hope that relatively narrow intelligence can already make significant progress on alignment. I think this is extremely unlikely if we reflect on what general intelligence truly is. Though my own thoughts on the nature of intelligence are not entirely coherent yet, I'd argue that having a strong concept of intelligence is key to accurately predicting the outcome of an alignment strategy.
Specifically in this case, my understanding is that general intelligence is being able to perform a wider set of operations on a wider set of inputs (to achieve a desired set of observations on the world state). For example, I can do addition of 2 apples I see, 2 apples I think about, 2 boats I hear about, 2 functions ...