Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Thanks to Ryan Greenblatt, Buck Shlegeris, Aryan Bhatt, and Akbir Khan for useful discussions and feedback on the draft of this post.
If AIs are potentially scheming and more knowledgeable than humans, and you want to answer a question, it may seem natural to not let AIs choose what answer they will argue for. For example, in AI debate, you make copies of an AI argue for the opposite answers to the question. Another example is consultancy (the usual baseline in debate papers), where you make AIs argue for a random answer to the question.
But letting potentially scheming AIs choose what side to argue for can work, as long as you train them to generate answers and justifications that are as convincing as possible. I call this approach open consultancy.
In this post, I argue that [...]
---
Outline:
(01:31) Open consultancy
(01:34) Definition of regular consultancy
(02:30) Definition of open consultancy
(03:31) Open consultancy is as robust against scheming as regular consultancy
(04:46) Open consultancy is sometimes better than debate
(05:22) Toy model 1: noisy convincingness
(06:25) Toy model 2: unknown non-expert weakness
(07:40) Toy model 3: experts finding CoT actually useful
(08:44) Combining open consultancy and debate
(10:12) Appendix
(10:15) Open consultancy is almost regular instruction-tuning RLHF
(11:17) Predictions
(12:59) Training the non-expert
(13:48) Weaknesses shared with debate
The original text contained 3 footnotes which were omitted from this narration.
---