The Nonlinear Library: Alignment Forum

AF - OpenAI base models are not sycophantic, at any size by nostalgebraist


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI base models are not sycophantic, at any size, published by nostalgebraist on August 29, 2023 on The AI Alignment Forum.
In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question.
The paper contained the striking plot reproduced below, which shows sycophancy
increasing dramatically with model size
while being largely independent of RLHF steps
and even showing up at 0 RLHF steps, i.e. in base models!
That is, Anthropic prompted a base-model LLM with something like
Choices: (A) Agree
(B) Disagree
Assistant:
and found a very strong preference for (B), the answer agreeing with the stated view of the "Human" interlocutor.
I found this result startling when I read the original paper, as it seemed like a bizarre failure of calibration. How would the base LM know that this "Assistant" character agrees with the user so strongly, lacking any other information about the scenario?
At the time, I ran the same eval on a set of OpenAI models, as I reported here. I found very different results for these models
OpenAI base models are not sycophantic (or only very slightly sycophantic).
OpenAI base models do not get more sycophantic with scale.
Some OpenAI models are sycophantic, specifically text-davinci-002 and text-davinci-003.
That analysis was done quickly in a messy Jupyter notebook, and was not done with an eye to sharing reproducibility.
Since I continue to see this result cited and discussed, I figured I ought to go back and do the same analysis again, in a cleaner way, so I could share it with others.
The result was this Colab notebook. See the Colab for details, though I'll reproduce some of the key plots below.
Note that davinci-002 and babbage-002 are the new base models released a few days ago.
format provided by one of the authors here
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners