August 29, 2023

AF - OpenAI base models are not sycophantic, at any size by nostalgebraist

2 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI base models are not sycophantic, at any size, published by nostalgebraist on August 29, 2023 on The AI Alignment Forum.

In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question.

The paper contained the striking plot reproduced below, which shows sycophancy

increasing dramatically with model size

while being largely independent of RLHF steps

and even showing up at 0 RLHF steps, i.e. in base models!

That is, Anthropic prompted a base-model LLM with something like

Choices: (A) Agree

(B) Disagree

Assistant:

and found a very strong preference for (B), the answer agreeing with the stated view of the "Human" interlocutor.

I found this result startling when I read the original paper, as it seemed like a bizarre failure of calibration. How would the base LM know that this "Assistant" character agrees with the user so strongly, lacking any other information about the scenario?

At the time, I ran the same eval on a set of OpenAI models, as I reported here. I found very different results for these models

OpenAI base models are not sycophantic (or only very slightly sycophantic).

OpenAI base models do not get more sycophantic with scale.

Some OpenAI models are sycophantic, specifically text-davinci-002 and text-davinci-003.

That analysis was done quickly in a messy Jupyter notebook, and was not done with an eye to sharing reproducibility.

Since I continue to see this result cited and discussed, I figured I ought to go back and do the same analysis again, in a cleaner way, so I could share it with others.

The result was this Colab notebook. See the Colab for details, though I'll reproduce some of the key plots below.

Note that davinci-002 and babbage-002 are the new base models released a few days ago.

format provided by one of the authors here

Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

...more

View all episodes

By The Nonlinear Fund

August 29, 2023

AF - OpenAI base models are not sycophantic, at any size by nostalgebraist

2 minutes

The paper contained the striking plot reproduced below, which shows sycophancy

increasing dramatically with model size

while being largely independent of RLHF steps

and even showing up at 0 RLHF steps, i.e. in base models!

That is, Anthropic prompted a base-model LLM with something like

Choices: (A) Agree

(B) Disagree

Assistant:

and found a very strong preference for (B), the answer agreeing with the stated view of the "Human" interlocutor.

At the time, I ran the same eval on a set of OpenAI models, as I reported here. I found very different results for these models

OpenAI base models are not sycophantic (or only very slightly sycophantic).

OpenAI base models do not get more sycophantic with scale.

Some OpenAI models are sycophantic, specifically text-davinci-002 and text-davinci-003.

That analysis was done quickly in a messy Jupyter notebook, and was not done with an eye to sharing reproducibility.

Since I continue to see this result cited and discussed, I figured I ought to go back and do the same analysis again, in a cleaner way, so I could share it with others.

The result was this Colab notebook. See the Colab for details, though I'll reproduce some of the key plots below.

Note that davinci-002 and babbage-002 are the new base models released a few days ago.

format provided by one of the authors here

Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - OpenAI base models are not sycophantic, at any size by nostalgebraist

Sign up to save your podcasts

AF - OpenAI base models are not sycophantic, at any size by nostalgebraist

AF - OpenAI base models are not sycophantic, at any size by nostalgebraist

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast