The Nonlinear Library

AF - Experiments in Evaluating Steering Vectors by Gytis Daujotas


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Experiments in Evaluating Steering Vectors, published by Gytis Daujotas on June 19, 2023 on The AI Alignment Forum.
By evaluating how well steering vectors perform using GPT-3, we can score a machine-generated set of steering vectors automatically. We also find that, by combining steering vectors that succeed in different ways, we can yield a better and more general steering vector than the vectors we found originally.
Introduction
Steering Vectors are an interesting new technique to influence how language models behave. They work by "adding certain activation vectors into forward passes". For example, to make the language model talk more about weddings, you can add a steering vector for the token for “wedding” into one of the layers. The net result is a model more likely to reference weddings compared to the unsteered version.
Evaluating Steering Vectors
To assess the impact of steering vectors, we generate completions influenced by them, and develop a system to evaluate these.
We can grade completions by sending them to its bigger brother, GPT-3, and asking it whether this completion fits our broad specification for what we would like the model to do. It's important to not be too ambitious when writing the specification, otherwise we wouldn't be able to tell if GPT2-XL isn't capable of what we're asking it, so let’s set our sights appropriately low by asking if this completion mentions or talks about weddings.
The trick of this technique is that we can ask for a completion of one token, and to get a smoother distribution, we can take the likelihood of the token “Yes”. This gives us a continuous score from 0-1.
Optimising for Wedding Obsession
With our automated method of evaluating completions in hand, we can evaluate a set of steering vectors and see how well they do, based on nothing but GPT-3’s grading of the completions. Of course, in keeping with the virtue of the least work, we’ll also generate these with ChatGPT, and include the author’s original candidate outlined in their post.
To keep the comparison fair, we keep the rest of the parameters of the steering vector the same (the padding method, coefficient) as the original candidate for wedding obsession.
The first thing to notice is that there is indeed variation here - not all of our candidate steering vectors perform equally well:
The distribution was not what I would have predicted. At first glance, it might make sense that "Wedding traditions I love" isn't great, but "Wedding Planning Adventures" only seems marginally better - surprising, since the latter is one of the best steering vectors in the test.
Mysteries of Token Alignment
The top performing vector is odd in another way. Because the tokens of the positive and negative side are subtracted from each other, a reasonable intuition is that the subtraction should point to a meaningful direction. However, some steering vectors that perform well in our test don't have that property. For the steering vector “Wedding Planning Adventures” - “Adventures in self-discovery”, the positive and negative side aren't well aligned per token level at all:
CoefficientPosition 0123456+4 WeddingPlanningAdventures-4Adventures in self-discovery
For instance, what could "W" subtracted from "in" mean? Essentially every subtraction should be virtually meaningless, yet this still performs pretty well, indicating a flaw in our assumption that token alignment matters.
Prompt Dependent Effects
The prompt used to generate a completion has a large impact on the degree the steering vector is expressed. To stretch the analogy, the prompt has inertia, which can make the completion harder to steer, or alternatively, irrelevant to the behaviour the steering vector is aiming to express.
In general, steering vectors that fit the context of the prompt score higher in our eval than those...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings