June 19, 2023

AF - Experiments in Evaluating Steering Vectors by Gytis Daujotas

7 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Experiments in Evaluating Steering Vectors, published by Gytis Daujotas on June 19, 2023 on The AI Alignment Forum.

By evaluating how well steering vectors perform using GPT-3, we can score a machine-generated set of steering vectors automatically. We also find that, by combining steering vectors that succeed in different ways, we can yield a better and more general steering vector than the vectors we found originally.

Introduction

Steering Vectors are an interesting new technique to influence how language models behave. They work by "adding certain activation vectors into forward passes". For example, to make the language model talk more about weddings, you can add a steering vector for the token for “wedding” into one of the layers. The net result is a model more likely to reference weddings compared to the unsteered version.

Evaluating Steering Vectors

To assess the impact of steering vectors, we generate completions influenced by them, and develop a system to evaluate these.

We can grade completions by sending them to its bigger brother, GPT-3, and asking it whether this completion fits our broad specification for what we would like the model to do. It's important to not be too ambitious when writing the specification, otherwise we wouldn't be able to tell if GPT2-XL isn't capable of what we're asking it, so let’s set our sights appropriately low by asking if this completion mentions or talks about weddings.

The trick of this technique is that we can ask for a completion of one token, and to get a smoother distribution, we can take the likelihood of the token “Yes”. This gives us a continuous score from 0-1.

Optimising for Wedding Obsession

With our automated method of evaluating completions in hand, we can evaluate a set of steering vectors and see how well they do, based on nothing but GPT-3’s grading of the completions. Of course, in keeping with the virtue of the least work, we’ll also generate these with ChatGPT, and include the author’s original candidate outlined in their post.

To keep the comparison fair, we keep the rest of the parameters of the steering vector the same (the padding method, coefficient) as the original candidate for wedding obsession.

The first thing to notice is that there is indeed variation here - not all of our candidate steering vectors perform equally well:

The distribution was not what I would have predicted. At first glance, it might make sense that "Wedding traditions I love" isn't great, but "Wedding Planning Adventures" only seems marginally better - surprising, since the latter is one of the best steering vectors in the test.

Mysteries of Token Alignment

The top performing vector is odd in another way. Because the tokens of the positive and negative side are subtracted from each other, a reasonable intuition is that the subtraction should point to a meaningful direction. However, some steering vectors that perform well in our test don't have that property. For the steering vector “Wedding Planning Adventures” - “Adventures in self-discovery”, the positive and negative side aren't well aligned per token level at all:

CoefficientPosition 0123456+4 WeddingPlanningAdventures-4Adventures in self-discovery

For instance, what could "W" subtracted from "in" mean? Essentially every subtraction should be virtually meaningless, yet this still performs pretty well, indicating a flaw in our assumption that token alignment matters.

Prompt Dependent Effects

The prompt used to generate a completion has a large impact on the degree the steering vector is expressed. To stretch the analogy, the prompt has inertia, which can make the completion harder to steer, or alternatively, irrelevant to the behaviour the steering vector is aiming to express.

In general, steering vectors that fit the context of the prompt score higher in our eval than those...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

June 19, 2023

AF - Experiments in Evaluating Steering Vectors by Gytis Daujotas

7 minutes

Introduction

Evaluating Steering Vectors

To assess the impact of steering vectors, we generate completions influenced by them, and develop a system to evaluate these.

Optimising for Wedding Obsession

To keep the comparison fair, we keep the rest of the parameters of the steering vector the same (the padding method, coefficient) as the original candidate for wedding obsession.

The first thing to notice is that there is indeed variation here - not all of our candidate steering vectors perform equally well:

Mysteries of Token Alignment

CoefficientPosition 0123456+4 WeddingPlanningAdventures-4Adventures in self-discovery

Prompt Dependent Effects

In general, steering vectors that fit the context of the prompt score higher in our eval than those...

...more

Share AF - Experiments in Evaluating Steering Vectors by Gytis Daujotas

Sign up to save your podcasts

AF - Experiments in Evaluating Steering Vectors by Gytis Daujotas

AF - Experiments in Evaluating Steering Vectors by Gytis Daujotas