August 29, 2023

AF - An OV-Coherent Toy Model of Attention Head Superposition by LaurenGreenspan

11 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An OV-Coherent Toy Model of Attention Head Superposition, published by LaurenGreenspan on August 29, 2023 on The AI Alignment Forum.

Background

This project was inspired by Anthropic's post on attention head superposition, which constructed a toy model trained to learn a circuit to identify skip-trigrams that are OV-incoherent (attending from multiple destination tokens to a single source token) as a way to ensure that superposition would occur. Since the OV circuit only sees half of the information - the source tokens - the OV circuit of a single head cannot distinguish between multiple possible skip-trigrams. As long as there are more skip-trigrams with the same source-token to represent than heads, the model cannot represent them in the naive way, and may resort to superposition.

In a more recent update post, they found that the underlying algorithm for OV-incoherent skip-trigrams in a simpler 2-head model implemented a conditional on the source token. One head predicts the output for the skip trigram [current token] . [current token] -> [ground truth([0]...[current token])], one of which will yield the right answer. The second head destructively interferes with this result by writing out the negative logit contribution of the first head if the source token is not the one common to all skip-trigrams (in this case, [0]). Because their example cleanly separated tasks between the two attention heads, the authors argued that it was more like the building of high-level features out of low-level ones than a feature superimposed across multiple attention heads.

OV-coherent Superposition

Instead, we claim there is an analogous force pushing the model toward adopting a distributed representation/head superposition whenever the model must learn patterns that require implementing nonlinear functions of multiple source tokens given a fixed destination token. We call this "OV-coherent" superposition: despite of the information at the destination position being fixed, the information copied from an attended-to token depends on the information at source tokens to which it is not attending. This pushes the model to form interference patterns between heads attending to different tokens.

To test this, we implemented a 1-layer, attention-only toy model with one-hot (un)embeddings trained to solve a problem requiring attention to multiple source tokens, described below. Here, we focus on a 2-head model which solves the task with perfect accuracy, and lay out some interesting motifs for further investigation.

Key Takeaways:

Heads in our model seem to implement nested conditional statements that exploit the if-else nature of the QK circuits. This means they can learn to write more specific information conditional on attending to certain tokens, given that it can implicitly rule out the existence of other tokens elsewhere in the context. The heads furthermore implement these nested conditionals in such a way that they distribute important source tokens between them, and constructively interfere to produce the correct answer.

Most of the time, we found that this "conditional dependence" relies on heads implementing an "all or nothing" approach to attention. Heads do not generally spread their attention across multiple interesting tokens, but instead move through the hierarchy of features in their QK circuits and attend to the most "interesting" (still a poorly defined term!) one present. This seems to be a common property of attention patterns in real-model heads as well.

When there are multiple important source tokens to attend to in the context, heads implementing interference schema will tend to learn QK circuits such that they distribute tokens amongst themselves and don't leave crucial information unattended to. In 2-head models, this manifests are reversed "preference orderings" ov...

...more