The Nonlinear Library

AF - Why Not Just Outsource Alignment Research To An AI? by johnswentworth


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Not Just Outsource Alignment Research To An AI?, published by johnswentworth on March 9, 2023 on The AI Alignment Forum.
Warmup: The Expert
If you haven’t seen “The Expert” before, I recommend it as a warmup for this post:
The Client: “We need you to draw seven red lines, all strictly perpendicular. Some with green ink, some with transparent. Can you do that?”
(... a minute of The Expert trying to explain that, no, he cannot do that, nor can anyone else.)
The Client: “So in principle, this is possible.”
This. This is what it looks like in practice, by default, when someone tries to outsource some cognitive labor which they could not themselves perform. At best, The Expert is well-intentioned and knows what the user needs, ignores the incoherent parts of The Client’s babbling, and does the right thing. Or, they manage to add some silly but ultimately harmless bells and whistles to satisfy whatever dumb thing The Client is looking for.
At worst. well, there’s more than one failure mode which could qualify for the title of "worst". Maybe The Expert gives The Client something which looks right to The Client and successfully conceals all the problems with it; presumably that’s a lucrative strategy for Experts. Maybe the Double Illusion of Transparency kicks in, both parties think they’ve successfully communicated, but in fact neither has any idea what’s going on in the other’s head. Maybe a well-intentioned Expert decides to ignore The Client’s incoherent babbling and do the thing which seems most likely to be right, but gets The Client’s preferences wrong.
One way or another, The Client’s ignorance is a major bottleneck to cognitive outsourcing. In practice, I expect The Client’s ignorance to be the primary bottleneck to cognitive outsourcing.
The core reason why we cannot just outsource alignment research to an AI is because we would then be The Client, and probably a very ignorant one.
Application to Alignment Schemes
There’s a lot of different flavors of “have the AI solve alignment for us”. A sampling:
Just prompt a language model to generate alignment research
Do some fine-tuning/RLHF on the language model to make it generate alignment research
Let the language model talk to other instances of itself, and prompt or fine-tune them together so they generate alignment research jointly
Set up a language model to generate alignment proposals and another to poke holes in them, and fine-tune the pair via a human judging the “debate”
As we go down the list, the proposals get fancier and add more bells and whistles, trying to make the AI a better expert. Sadly, none of them at all address what I expect to be the actual main bottleneck: The Client (i.e. the human user or users) has no understanding of what they need, what questions to ask, what’s possible or even logically coherent, etc.
What would this kind of error look like in practice?
Here’s one concrete example of the kind of failures I’d expect when a would-be outsourcer’s understanding falls short (from here):
Somebody literally types “If we take the action you just proposed, will we be happy with the outcomes?” into a GPT prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but in this case it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings