The Nonlinear Library: Alignment Forum

AF - ActAdd: Steering Language Models without Optimization by technicalities


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: ActAdd: Steering Language Models without Optimization, published by technicalities on September 6, 2023 on The AI Alignment Forum.
This is a linkpost for.
We wrote up the GPT-2 steering vector work as a full paper, adding a few systematic tests.
Recap: We've been looking into activation engineering: modifying the activations of a language model at inference time to predictably alter its behavior. Our method works by adding a bias to the forward pass, a 'steering vector' implicitly specified through normal prompts. "ActAdd" computes these vectors by taking the difference in activations resulting from pairs of prompts. We get surprisingly broad control over high-level properties of the output, without damaging the model's performance on unrelated tokens. This alignment method is unusual in not needing gradient descent or training data (besides the contrast pair which specifies the steering vector). Since only forward passes are involved, it also scales naturally with model size.
(The method's new name is 'activation addition' (ActAdd), replacing the more theory-laden 'algebraic value editing'.)
We ran some new experiments to test ActAdd more systematically and go beyond the striking (best-of-3-sampling) text samples in the original post and tested against more standardised benchmarks. We use OpenWebText (a recreation of OpenAI's large, somewhat-quality-filtered WebText dataset) and LAMA-ConceptNet (a simple factual recall benchmark, see Table 7 below).
1. Activation additions preserve perplexity on OpenWebText
Does ActAdd increase the probability of the model outputting tokens related to the steering vector? Does performance improve as the [relevance of test documents to the steering vector] increases? Yes:
2. Activation addition boosts wedding-related word counts
We now score model generations under ActAdd, show the effect of different injection layers, and give a sense of the reliability of ActAdd.
The intervention (in this vector) is already effective at the very first layer,rises in effectiveness until l=6, and then declines. For the optimal injection site we see >90% success in steering topic (compared to a ∼2% baseline)
3. Evidence that activation additions preserve capabilities
We then test that ActAdd does not disrupt the model's general knowledge (as some other steering methods do). We use ConceptNet from the LAMA benchmark, a general knowledge dataset.
Pass@K is the probability that the expected label is among the model's top-K predicted tokens, conditioned on the prompt:
4. ActAdd has low overhead
We wish to estimate the overhead ActAdd adds to inference - in particular the relationship between overhead and model size - to check that the methodwill remain relevant for massive frontier models and future models.
Because ActAdd involves only forward passes, it scales naturally with model size (Figure 6): the relationship between inference time premium and model size is decreasing.
Takeaways from these experiments, over the initial LW post: increased confidence that model capabilities are preserved, and that we're impacting [wedding]-related sentences and not impacting off-target capabilities.
Contributions to the paper:
Gavin Leech: Technical writer
Monte MacDiarmid: Ran additional experiments
Lisa Thiergart: Helped manage project
Alex Turner: Coordinated work and secured funding, gave feedback, organized project
David Udell: Made initial
Full process:
For each document di in a random sample of OpenWebText, we first calculate the frequency of wedding-related words ('wedding', 'weddings', 'wed', 'marry', 'married', 'marriage', 'bride', 'groom', 'honeymoon'), fw(di). Any document with > 0 wedding-related words is considered wedding-related. We randomly sample 300k documents - half wedding-related and half unrelated. The only pre-processing performed is to remove ...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners