June 03, 2024

LW - Comments on Anthropic's Scaling Monosemanticity by Robert AIZI

14 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Comments on Anthropic's Scaling Monosemanticity, published by Robert AIZI on June 3, 2024 on LessWrong.

These are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.

TL;DR

In roughly descending order of importance:

1. Its great that Anthropic trained an SAE on a production-scale language model, and that the approach works to find interpretable features. Its great those features allow interventions like the recently-departed Golden Gate Claude. I especially like the code bug feature.

2. I worry that naming features after high-activating examples (e.g. "the Golden Gate Bridge feature") gives a false sense of security. Most of the time that feature activates, it is irrelevant to the golden gate bridge. That feature is only well-described as "related to the golden gate bridge" if you condition on a very high activation, and that's <10% of its activations (from an eyeballing of the graph).

3. This work does not address my major concern about dictionary learning: it is not clear dictionary learning can find specific features of interest, "called-shot" features, or "all" features (even in a subdomain like "safety-relevant features"). I think the report provides ample evidence that current SAE techniques fail at this.

4. The SAE architecture seems to be almost identical to how Anthropic and my team were doing it 8 months ago, except that the ratio of features to input dimension is higher. I can't say exactly how much because I don't know the dimensions of Claude, but I'm confident the ratio is at least 30x (for their smallest SAE), up from 8x 8 months ago.

5. The correlations between features and neurons seems remarkably high to me, and I'm confused by Anthropic's claim that "there is no strongly correlated neuron".

6. Still no breakthrough on "a gold-standard method of assessing the quality of a dictionary learning run", which continues to be a limitation on developing the technique. The metric they primarily used was the loss function (a combination of reconstruction accuracy and L1 sparsity).

I'll now expand some of these into sections. Finally, I'll suggest some follow-up research/tests that I'd love to see Anthropic (or a reader like you) try.

A Feature Isn't Its Highest Activating Examples

Let's look at the Golden Gate Bridge feature because its fun and because it's a good example of what I'm talking about. Here's my annotated version of Anthropic's diagram:

I think Anthropic successfully demonstrated (in the paper and with Golden Gate Claude) that this feature, at very high activation levels, corresponds to the Golden Gate Bridge. But on a median instance of text where this feature is active, it is "irrelevant" to the Golden Gate Bridge, according to their own autointerpretability metric! I view this as analogous to naming water "the drowning liquid", or Boeing the "door exploding company".

Yes, in extremis, water and Boeing are associated with drowning and door blowouts, but any interpretation that ends there would be limited.

Anthropic's work writes around this uninterpretability in a few ways, by naming the feature based on the top examples, highlighting the top examples, pinning the intervention model to 10x the activation (vs .1x its top activation), and showing subsamples from evenly spaced intervals (vs deciles).

I think would be illuminating if they added to their feature browser page some additional information about the fraction of instances in each subsample, e.g., "Subsample Interval 2 (0.4% of activations)".

Whether a feature is or isn't its top activating examples is important because it constrains their usefulness:

Could work with our current feature discovery approach: find the "aligned with human flourishing" feature, and pin that to 10x its max activation. ...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

June 03, 2024

LW - Comments on Anthropic's Scaling Monosemanticity by Robert AIZI

14 minutes

These are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.

TL;DR

In roughly descending order of importance:

5. The correlations between features and neurons seems remarkably high to me, and I'm confused by Anthropic's claim that "there is no strongly correlated neuron".

I'll now expand some of these into sections. Finally, I'll suggest some follow-up research/tests that I'd love to see Anthropic (or a reader like you) try.

A Feature Isn't Its Highest Activating Examples

Let's look at the Golden Gate Bridge feature because its fun and because it's a good example of what I'm talking about. Here's my annotated version of Anthropic's diagram:

Yes, in extremis, water and Boeing are associated with drowning and door blowouts, but any interpretation that ends there would be limited.

Whether a feature is or isn't its top activating examples is important because it constrains their usefulness:

Could work with our current feature discovery approach: find the "aligned with human flourishing" feature, and pin that to 10x its max activation. ...

...more

Share LW - Comments on Anthropic's Scaling Monosemanticity by Robert AIZI

Sign up to save your podcasts

LW - Comments on Anthropic's Scaling Monosemanticity by Robert AIZI

LW - Comments on Anthropic's Scaling Monosemanticity by Robert AIZI