The Nonlinear Library: Alignment Forum

AF - Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought by miles


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought, published by miles on March 11, 2024 on The AI Alignment Forum.
[Twitter thread]
I'm not going to add much additional commentary at the moment and will just let people check out the paper!
But to give a bit more context: This paper is building off prior work we have done showing that chain-of-thought explanations can be misleading, which I wrote about on the alignment forum here. Broadly, this work fits into the process-based oversight agenda (e.g., here). Consistency training/evaluation also fits into scalable oversight: Evaluating consistency may be easier than evaluating a model's reasoning or final predictions directly (e.g., also explored here).
Abstract:
While chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning, it can systematically misrepresent the factors influencing models' behavior--for example, rationalizing answers in line with a user's opinion without mentioning this bias.
To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervised fine-tuning scheme that trains models to give consistent reasoning across prompts with and without biasing features. We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86% on held-out tasks.
Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37%. As BCT generalizes to held-out biases and does not require gold labels, this method may hold promise for reducing biased reasoning from as-of-yet unknown biases and on tasks where supervision for ground truth reasoning is unavailable.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners