August 02, 2024

LW - The 'strong' feature hypothesis could be wrong by lsgos

30 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The 'strong' feature hypothesis could be wrong, published by lsgos on August 2, 2024 on LessWrong.

NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position.

"It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an "ideal" ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout" : Elhage et. al, Toy Models of Superposition

Recently, much attention in the field of mechanistic interpretability, which tries to explain the behavior of neural networks in terms of interactions between lower level components, has been focussed on extracting features from the representation space of a model. The predominant methodology for this has used variations on the sparse autoencoder, in a series of papers inspired by Elhage et. als. model of superposition.

Conventionally there understood to be two key theories underlying this agenda. The first is the 'linear representation hypothesis' (LRH), the hypothesis that neural networks represent many intermediates or variables of the computation (such as the 'features of the input' in the opening quote) as linear directions in it's representation space, or atoms[1].

And second, the theory that the network is capable of representing more of these 'atoms' than it has dimensions in its representation space, via superposition (the superposition hypothesis).

While superposition is a relatively uncomplicated hypothesis, I think the LRH is worth examining in more detail. It is frequently stated quite vaguely, and I think there are several possible formulations of this hypothesis, with varying degrees of plausibility, that it is worth carefully distinguishing between. For example, the linear representation hypothesis is often stated as 'networks represent features of the input as directions in representation space'.

There are a few possible formulations of this:

1. (Weak LRH) some features used by neural networks are represented as atoms in representation space

2. (Strong LRH) all features used by neural networks are represented by atoms.

The weak LRH I would say is now well supported by considerable empirical evidence. The strong form is much more speculative: confirming the existence of many linear representations does not necessarily provide strong evidence for the strong hypothesis. Both the weak and the strong forms of the hypothesis can still have considerable variation, depending on what we understand by a feature.

I think that in addition to the acknowledged assumption of the LRH and superposition hypotheses, much work on SAEs in practice makes the assumption that each atom in the network will represent a "simple feature" or a "feature of the input". These features that the atoms are representations of are assumed to be 'monosemantic': they will all stand for features which are human interpretable in isolation. I will call this the monosemanticity assumption.

This is difficult to state precisely, but we might formulate as the theory that every represented variable will have a single meaning in a good description of a model. This is not a straightforward assumption due to how imprecise the notion of a single meaning is. While various more or less reasonable definitions for features are discussed in the pioneering work of Elhage, these assumptions have different implications.

For instance, if one thinks of 'features' as computational intermediates in a broad sense, then superposition and the LRH imply a certain picture of the format of a models internal representation: that what the network is doing is manipulating atoms in superposition (if y...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

August 02, 2024

LW - The 'strong' feature hypothesis could be wrong by lsgos

30 minutes

NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position.

And second, the theory that the network is capable of representing more of these 'atoms' than it has dimensions in its representation space, via superposition (the superposition hypothesis).

There are a few possible formulations of this:

1. (Weak LRH) some features used by neural networks are represented as atoms in representation space

2. (Strong LRH) all features used by neural networks are represented by atoms.

...more

Share LW - The 'strong' feature hypothesis could be wrong by lsgos

Sign up to save your podcasts

LW - The 'strong' feature hypothesis could be wrong by lsgos

LW - The 'strong' feature hypothesis could be wrong by lsgos