The Nonlinear Library

LW - Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition by cmathw


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition, published by cmathw on April 9, 2024 on LessWrong.
This work represents progress on removing attention head superposition. We are excited by this approach but acknowledge there are currently various limitations. In the short term, we will be working on adjacent problems are excited to collaborate with anyone thinking about similar things!
Produced as part of the
ML Alignment & Theory Scholars Program - Summer 2023 Cohort
Summary: In transformer language models, attention head superposition makes it difficult to study the function of individual attention heads in isolation. We study a particular kind of attention head superposition that involves constructive and destructive interference between the outputs of different attention heads. We propose a novel architecture - a 'gated attention block' - which resolves this kind of attention head superposition in toy models.
In future, we hope this architecture may be useful for studying more natural forms of attention head superposition in large language models.
Our code can be found here.
Background
Mechanistic interpretability aims to reverse-engineer what neural networks have learned by decomposing a network's functions into human-interpretable algorithms. This involves isolating the individual components within the network that implement particular behaviours. This has proven difficult, however, because networks make use of polysemanticity and superposition to represent information.
Polysemanticity in a transformer's multi-layer perceptron (MLPs) layers is when neurons appear to represent many unrelated concepts (Gurnee et al., 2023). We also see this phenomena within the transformer's attention mechanism, when a given attention head performs qualitatively different functions based on its destination token and context (Janiak et al., 2023).
Superposition occurs when a layer in a network (an 'activation space') represents more features than it has dimensions. This means that features are assigned to an overcomplete set of directions as opposed to being aligned with e.g. the neuron basis.
The presence of polysemanticity means that the function of a single neuron or attention head cannot be defined by the features or behaviours it expresses on a subset of its training distribution because it may serve different purposes on different subsets of the training distribution. Relatedly, superposition makes it misleading to study the function of individual neurons or attention heads in isolation from other neurons or heads.
Both of these phenomena promote caution around assigning specific behaviours to individual network components (neurons or attention heads), due to there both being a diversity in behaviours across a training distribution and in their interaction with other components in the network.
Although polysemanticity and superposition make the isolated components of a network less immediately interpretable, understanding of the correct functional units of analysis has improved. Progress has been made on both understanding features as directions within an activation space (Elhage et al., 2023) and resolving feature superposition by applying sparse autoencoders to identify highly-interpretable features (Sharkey et al., 2022; Cunningham et al., 2023; Bricken et al., 2023).
Attention head superposition for OV-Incoherent Skip Trigrams
Superposition in the context of attention heads is less understood. It is however conceivable that an attention block could make use of a similar compression scheme to implement more behaviours than the number of attention heads in the block.
Prior work introduced a task to study attention head superposition in the form of OV-Incoherent Skip Trigrams (Jermyn et al., 2023; Conerly et al., 2023). These are s...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings