The Nonlinear Library

LW - Open problems in activation engineering by TurnTrout


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Open problems in activation engineering, published by TurnTrout on July 24, 2023 on LessWrong.
Steering GPT-2-XL by adding an activation vector introduced
activation engineering... techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime.
These results were recently complemented by Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, which doubled TruthfulQA performance by adding a similarly computed activation vector to forward passes!
We think that activation engineering has a bunch of low-hanging fruit for steering and understanding models. A few open problems from the list:
Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful?
Take a circuit studied from existing literature on GPT2, or find another one using ACDC. Targeting the nodes in these circuits, can you learn anything more about them and generally about how activation additions interact with circuits?
What's the mechanism by which adding a steering vector with too large a coefficient breaks the model? (Credit: Thomas Kwa; see also @Ulisse Mini's initial data/explanation.)
If you want to work on activation engineering, come by the Slack server to coordinate research projects and propose new ideas.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings