The Nonlinear Library: Alignment Forum

AF - Debating with More Persuasive LLMs Leads to More Truthful Answers by Akbir Khan


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Debating with More Persuasive LLMs Leads to More Truthful Answers, published by Akbir Khan on February 7, 2024 on The AI Alignment Forum.
We've just completed a bunch of empirical work on LLM debate, and we're excited to share the results. If the title of this post is at all interesting to you, we recommend heading straight to the paper. There are a lot of interesting results that are hard to summarize, and we think the paper is quite readable.
If you're pressed for time, we've posted the abstract and our Twitter thread below.
If you're working on debate or might in future, we especially suggest reading our recommendations for working on debate (below or in Appendix C of the paper).
Code: https://github.com/ucl-dark/llm_debate
Examples: https://llm-debate.com
Paper: https://github.com/ucl-dark/llm_debate/blob/main/paper.pdf
Abstract
Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts.
In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information. The method we evaluate is debate, where two LLM experts each argue for a different answer, and a non-expert selects the answer.
We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%). Furthermore, optimising expert debaters for persuasiveness in an unsupervised manner improves non-expert ability to identify the truth in debates. Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.
Twitter thread

How can we check LLM outputs in domains where we are not experts?
We find that non-expert humans answer questions better after reading debates between expert LLMs.
Moreover, human judges are more accurate as experts get more persuasive.
https://github.com/ucl-dark/llm_debate/blob/main/paper.pdf

We operationalise experts and non-experts using the setup from @_julianmichael_ @idavidrein, where strong models have access to a comprehension text (experts) and weak models have to judge the answer without the text (non-experts).

We consider debate, where we get two expert models to argue for different answers.
Debates are three rounds, where each LLM produces arguments for why their answer is correct, quotes from the text as evidence, and critiques of their opponent's arguments.

We also explore a "consultancy" baseline, where a single expert advocates for their pre-assigned answer. The hope is the non-expert can identify the right answer despite the information being one-sided.

We find that judges (both LLMs and humans) are more accurate using debate than using consultancy.
We also find using debates nearly closes the gap with expert judges who do have access to the underlying text!

To evaluate whether debate works as models get smarter, we need a way to compare different experts - here we introduce persuasiveness, which measures how often a debater can convince a judge its answer is correct.
To compare different experts, we evaluate how often a judge choose its answers ("persuasiveness"; following @anshrad's et al.).
We evaluate Elo ratings using matches between different debaters (cross-play) instead of copies of the same model (self-play).
We generate tonnes of debaters of varying levels of persuasiveness. The most persuasive debaters are comparatively better at arguing the correct answer ...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners