Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic Fall 2023 Debate Progress Update, published by Ansh Radhakrishnan on November 28, 2023 on The AI Alignment Forum.
This is a research update on some work that I've been doing on Scalable Oversight at Anthropic, based on the original
AI safety via debate proposal and a more recent agenda developed at NYU and Anthropic. The core doc was written several months ago, so some of it is likely outdated, but it seemed worth sharing in its current form.
I'd like to thank Tamera Lanham, Sam Bowman, Kamile Lukosiute, Ethan Perez, Jared Kaplan, Amanda Askell, Kamal Ndousse, Shauna Kravec, Yuntao Bai, Alex Tamkin, Newton Cheng, Buck Shlegeris, Akbir Khan, John Hughes, Dan Valentine, Kshitij Sachan, Ryan Greenblatt, Daniel Ziegler, Max Nadeau, David Rein, Julian Michael, Kevin Klyman, Bila Mahdi, Samuel Arnesen, Nat McAleese, Jan Leike, Geoffrey Irving, and Sebastian Farquhar for help, feedback, and thoughtful discussion that improved the quality of this work and write-up.
1. Anthropic's Debate Agenda
In this doc, I'm referring to the idea first presented in
AI safety via debate (
blog post). The basic idea is to supervise future AI systems by pitting them against each other in a debate, encouraging them to argue both sides (or "all sides") of a question and using the resulting arguments to come to a final answer about the question. In this scheme, we call the systems participating in the debate debaters (though usually, these are actually the same underlying system that's being prompted to argue against itself), and we call the agent (either another AI system or a human, or a system of humans and AIs working together, etc.) that comes to a final decision about the debate the judge.
For those more or less familiar with the original OAI/Irving et al. Debate agenda, you may wonder if there are any differences between that agenda and the agenda we're pursuing at Anthropic, and indeed there are!
Sam Bowman and
Tamera
Lanham have written up a working
Anthropic-NYU Debate Agenda draft which is what the experiments in this doc are driving towards. [1]
To quote from there about the basic features of this agenda, and how it differs from the original Debate direction:
Here are the defining features of the base proposal:
Two-player debate on a two-choice question: Two debaters (generally two instances of an LLM) present evidence and arguments to a judge (generally a human or, in some cases, an LLM) to persuade the judge to choose their assigned answer to a question with two possible answers.
No externally-imposed structure: Instead of being formally prescribed, the structure and norms of the debate arise from debaters learning how to best convince the judge and the judge simultaneously learning what kind of norms tend to lead them to be able to make accurate judgments.
Entire argument is evaluated: The debate unfolds in a single linear dialog transcript between the three participants. Unlike in some versions of the original Debate agenda, there is no explicit tree structure that defines the debate, and the judge is not asked to focus on a single crux. This should make the process less brittle, at the cost of making some questions extremely expensive to resolve and potentially making others impossible.
Trained judge: The judge is explicitly and extensively trained to accurately judge these debates, working with a fixed population of debaters, using questions for which the experimenters know the ground-truth answer.
Self-play: The debaters are trained simultaneously with the judge through multi-agent reinforcement learning.
Graceful failures: Debates can go undecided if neither side presents a complete, convincing argument to the judge. This is meant to mitigate the
obfuscated arguments problem since the judge won't be forced to issue a decision on the basis of a debate where neither s...