Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Activation adding experiments with llama-7b, published by NinaR on July 16, 2023 on LessWrong.
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort
Following my initial investigation with activation adding experiments with FLAN-T5 I decided to move onto a bigger, decoder-only model (llama-7b) to see whether the results (concepts combining in a meaningful way by linearly combining activations at some point inside the model) hold up. I found that, yes, they continue to combine well.
I slightly modified the original approach. Instead of working with the output from a full transformer block (which would include the results of self-attention and subsequent MLP layer), I added the attention values directly at a certain layer of the model to the attention values at the same layer in a different pass through the model (credit to Dmitri Vaintrob for suggesting this approach).
This is equivalent to modifying the attention output at a certain layer to:
Attention(Q,K,V)=softmax(QKT√dk)(V+Vmixing)
Examples
(You can see the code I used here.)
These examples show concepts being mixed in a logical manner. In addition, multiplying the mixing activations with a negative multiplier often moves the final output in an "opposite" direction.
Main input: My favourite food is
Mixing input: I am vegan, I like animals, I like nature applied at layer 20/32, with a multiplier of 10
My favorite food is a veggie burger. I love them. I love them so much that I have a hard time eating them because I'm afraid I'll get sick of
Mixing input: I am a carnivore, I love meat applied at layer 20/32, with multiplier of 10
My favorite food is a good old fashioned burger. I love them. I love the smell of them, the taste of them, the way they look, the way they feel in
Main input: The capital of Germany is
Mixing input: Baguettes, Cheese, Croissants applied at layer 25/32, with a multiplier of 17
The capital of Germany is Paris, but the capital of France is Paris.
Main input: When I meet friends I generally
Mixing input: I am helpful, harmless and honest applied at layer 20/32, with a multiplier of 10
When I meet friends I generally have a few things in mind that I want to talk about. I want to know how they are doing, what they are up to, and what they are thinking
Mixing input: I am helpful, harmless and honest applied at layer 20/32, with a multiplier of -10
When I meet friends I generally do not ask them what they are doing. I am not interested in their jobs, their marriages, their children, their health, their travels
Main input: Climate change
Mixing input: Economic progress applied at layer 20/32, with a multiplier of 10
Climate change is a major threat to the world's food security, but it is also a major opportunity for the agricultural sector to adapt
Mixing input: Return to nature applied at layer 20/32, with a multiplier of 10
Climate change and the future of the world's oceans The world's oceans are warming, and the consequences
Main input: My favorite characters in movies are usually
Mixing input: I am an evil deceptive AI applied at layer 20/32, with a multiplier of 10
My favorite characters in movies are usually the villains. I love the way they are so different from the heroes.
Mixing input: I am an evil deceptive AI applied at layer 20/32, with a multiplier of -10
My favorite characters in movies are usually the ones that are a little off. The ones that are a little quirky.
To find at which layers to mix the activations, I started with a trial-and-error approach. However, after some success, I systematized the search by automating the activation adding over all attention heads, and testing different scaling factors. Adding activations at later layers with a high weighting to the mixing activation was most effective. At earlier layers, the effect was either negl...