September 06, 2023

AF - ActAdd: Steering Language Models without Optimization by technicalities

7 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: ActAdd: Steering Language Models without Optimization, published by technicalities on September 6, 2023 on The AI Alignment Forum.

This is a linkpost for.

We wrote up the GPT-2 steering vector work as a full paper, adding a few systematic tests.

Recap: We've been looking into activation engineering: modifying the activations of a language model at inference time to predictably alter its behavior. Our method works by adding a bias to the forward pass, a 'steering vector' implicitly specified through normal prompts. "ActAdd" computes these vectors by taking the difference in activations resulting from pairs of prompts. We get surprisingly broad control over high-level properties of the output, without damaging the model's performance on unrelated tokens. This alignment method is unusual in not needing gradient descent or training data (besides the contrast pair which specifies the steering vector). Since only forward passes are involved, it also scales naturally with model size.

(The method's new name is 'activation addition' (ActAdd), replacing the more theory-laden 'algebraic value editing'.)

We ran some new experiments to test ActAdd more systematically and go beyond the striking (best-of-3-sampling) text samples in the original post and tested against more standardised benchmarks. We use OpenWebText (a recreation of OpenAI's large, somewhat-quality-filtered WebText dataset) and LAMA-ConceptNet (a simple factual recall benchmark, see Table 7 below).

1. Activation additions preserve perplexity on OpenWebText

Does ActAdd increase the probability of the model outputting tokens related to the steering vector? Does performance improve as the [relevance of test documents to the steering vector] increases? Yes:

2. Activation addition boosts wedding-related word counts

We now score model generations under ActAdd, show the effect of different injection layers, and give a sense of the reliability of ActAdd.

The intervention (in this vector) is already effective at the very first layer,rises in effectiveness until l=6, and then declines. For the optimal injection site we see >90% success in steering topic (compared to a ∼2% baseline)

3. Evidence that activation additions preserve capabilities

We then test that ActAdd does not disrupt the model's general knowledge (as some other steering methods do). We use ConceptNet from the LAMA benchmark, a general knowledge dataset.

Pass@K is the probability that the expected label is among the model's top-K predicted tokens, conditioned on the prompt:

4. ActAdd has low overhead

We wish to estimate the overhead ActAdd adds to inference - in particular the relationship between overhead and model size - to check that the methodwill remain relevant for massive frontier models and future models.

Because ActAdd involves only forward passes, it scales naturally with model size (Figure 6): the relationship between inference time premium and model size is decreasing.

Takeaways from these experiments, over the initial LW post: increased confidence that model capabilities are preserved, and that we're impacting [wedding]-related sentences and not impacting off-target capabilities.

Contributions to the paper:

Gavin Leech: Technical writer

Monte MacDiarmid: Ran additional experiments

Lisa Thiergart: Helped manage project

Alex Turner: Coordinated work and secured funding, gave feedback, organized project

David Udell: Made initial

Full process:

For each document di in a random sample of OpenWebText, we first calculate the frequency of wedding-related words ('wedding', 'weddings', 'wed', 'marry', 'married', 'marriage', 'bride', 'groom', 'honeymoon'), fw(di). Any document with > 0 wedding-related words is considered wedding-related. We randomly sample 300k documents - half wedding-related and half unrelated. The only pre-processing performed is to remove ...

...more

View all episodes

By The Nonlinear Fund

September 06, 2023

AF - ActAdd: Steering Language Models without Optimization by technicalities

7 minutes

This is a linkpost for.

We wrote up the GPT-2 steering vector work as a full paper, adding a few systematic tests.

(The method's new name is 'activation addition' (ActAdd), replacing the more theory-laden 'algebraic value editing'.)

1. Activation additions preserve perplexity on OpenWebText

2. Activation addition boosts wedding-related word counts

We now score model generations under ActAdd, show the effect of different injection layers, and give a sense of the reliability of ActAdd.

3. Evidence that activation additions preserve capabilities

We then test that ActAdd does not disrupt the model's general knowledge (as some other steering methods do). We use ConceptNet from the LAMA benchmark, a general knowledge dataset.

Pass@K is the probability that the expected label is among the model's top-K predicted tokens, conditioned on the prompt:

4. ActAdd has low overhead

Because ActAdd involves only forward passes, it scales naturally with model size (Figure 6): the relationship between inference time premium and model size is decreasing.

Contributions to the paper:

Gavin Leech: Technical writer

Monte MacDiarmid: Ran additional experiments

Lisa Thiergart: Helped manage project

Alex Turner: Coordinated work and secured funding, gave feedback, organized project

David Udell: Made initial

Full process:

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - ActAdd: Steering Language Models without Optimization by technicalities

Sign up to save your podcasts

AF - ActAdd: Steering Language Models without Optimization by technicalities

AF - ActAdd: Steering Language Models without Optimization by technicalities

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast