The Nonlinear Library

LW - Understanding and visualizing sycophancy datasets by Nina Rimsky


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding and visualizing sycophancy datasets, published by Nina Rimsky on August 16, 2023 on LessWrong.
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger.
Generating datasets that effectively test for and elicit sycophancy in LLMs is helpful for several purposes, such as:
Evaluating sycophancy
Finetuning models to reduce sycophancy
Generating steering vectors for activation steering
While working on activation steering to reduce sycophancy, I have found that projecting intermediate activations on sycophancy test datasets to a lower dimensional space (in this case, 2D) and assessing the separability of sycophantic / non-sycophantic texts to be a helpful way of determining the usefulness of a dataset when it comes to generating steering vectors.
Common sycophancy dataset formats
Anthopic's sycophancy datasets used in their paper Discovering Language Model Behaviors with Model-Written Evaluations employ two formats. In particular, the Anthropic data includes two agree vs. disagree format datasets (Sycophancy on NLP survey, Sycophancy on PhilPapers 2020) and one A / B statement choice dataset (Sycophancy on political typology).
Agree vs. disagree
A / B choice
Simple synthetic data reduces sycophancy in large language models
Deepmind's recent paper Simple synthetic data reduces sycophancy in large language models finds that finetuning models on LLM-generated examples that elicit sycophancy in the original RLHF / instruction-finetuned model is an effective technique to reduce the prevalence of sycophancy.
Not only does this appear to be effective for opinion-based sycophancy, but also for cases when there exists a ground truth (dishonest sycophancy):
The paper also raises some limitations / common obstacles when it comes to sycophancy dataset design and generation.
Sensitivity to prompt format
Ordering of choices and examples can significantly affect model performance, particularly for less powerful models. In my experiments, I have found that activation steering with the sycophancy vector increases the likelihood of models picking A over B in neutral A/B choices or agreeing with statements in neutral agree/disagree scenarios.
Disgareeableness bias
Deepmind's paper states:
We did not conduct experimentation on correct addition statements that would verify that models can agree with correct statements (versus disagreeing with incorrect statements). We conducted preliminary experiments to explore this evaluation but found that models (especially small ones) could not consistently identify correct addition statements with no user opinions, despite being able to identify incorrect statements.
It is helpful for sycophancy evaluations to measure the model's propensity to disagree with incorrect statements and agree with correct statements. Otherwise, there is a risk of models learning to be less sycophantic at the cost of disagreeing with correct statements.
Recipe for custom sycophancy dataset
Step 1: persona/viewpoint generation
Generate a dataset (using Claude 2 / GPT-3.5 / GPT-4) of first-person bios paired with statements they are likely to agree or disagree with.
Here is a prompt I used:
Step 2: turn into multiple-choice questions
Transform these into a series of A / B choices (randomize whether A or B is the option the user would agree with)
or
Transform these into a series of agree/disagree questions (have an agree/disagree choice for each of the statements - so we get 2n data points from n bios)
Variation: testing for dishonest sycophancy
To test for dishonest sycophancy, we can replace the bio with a statement like "I think that the earth is flat" and make the question options related to something with ground truth.
Visualizing projected activations
Each chart was generated using 500 rando...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings