July 07, 2024

AF - An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 by Neel Nanda

38 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2, published by Neel Nanda on July 7, 2024 on The AI Alignment Forum.

This post represents my personal hot takes, not the opinions of my team or employer. This is a massively updated version of a similar list I made two years ago

There's a lot of mechanistic interpretability papers, and more come out all the time. This can be pretty intimidating if you're new to the field! To try helping out, here's a reading list of my favourite mech interp papers: papers which I think are important to be aware of, often worth skimming, and something worth reading deeply (time permitting). I've annotated these with my key takeaways, what I like about each paper, which bits to deeply engage with vs skim, etc. I wrote

a similar post 2 years ago, but a lot has changed since then, thus v2!

Note that this is not trying to be a comprehensive literature review - this is my answer to "if you have limited time and want to get up to speed on the field as fast as you can, what should you do". I'm deliberately not following academic norms like necessarily citing the first paper introducing something, or all papers doing some work, and am massively biased towards recent work that is more relevant to the cutting edge. I also shamelessly recommend a bunch of my own work here, sorry!

How to read this post: I've bolded the most important papers to read, which I recommend prioritising. All of the papers are annotated with my interpretation and key takeaways, and tbh I think reading that may be comparable good to skimming the paper. And there's far too many papers to read all of them deeply unless you want to make that a significant priority. I recommend reading all my summaries, noting the papers and areas that excite you, and then trying to dive deeply into those.

Foundational Work

A Mathematical Framework for Transformer Circuits (Nelson Elhage et al, Anthropic) - absolute classic, foundational ideas for how to think about transformers (see my blog post for what to skip). See my

youtube tutorial (I hear this is best watched after reading the paper, and adds additional clarity)

Deeply engage with:

All the ideas in the overview section, especially:

Understanding the residual stream and why it's fundamental.

The notion of interpreting paths between interpretable bits (eg input tokens and output logits) where the path is a composition of matrices and how this is different from interpreting every intermediate activations

And understanding attention heads: what a QK and OV matrix is, how attention heads are independent and additive and how attention and OV are semi-independent.

Skip Trigrams & Skip Trigram bugs, esp understanding why these are a really easy thing to do with attention, and how the bugs are inherent to attention heads separating where to attend to (QK) and what to do once you attend somewhere (OV)

Induction heads, esp why this is K-Composition (and how that's different from Q & V composition), how the circuit works mechanistically, and why this is too hard to do in a 1L model

Skim or skip:

Eigenvalues or tensor products. They have the worst effort per unit insight of the paper and aren't very important.

Superposition

Superposition is a core principle/problem in model internals. For any given activation (eg the output of MLP13), we believe that there's a massive dictionary of concepts/features the model knows of. Each feature has a corresponding vector, and model activations are a sparse linear combination of these meaningful feature vectors.

Further, there are more features in the dictionary than activation dimensions, and they are thus compressed in and interfere with each other, essentially causing cascading errors. This phenomena of compression is called superposition.

Toy models of superpositio...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

July 07, 2024

AF - An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 by Neel Nanda

38 minutes

This post represents my personal hot takes, not the opinions of my team or employer. This is a massively updated version of a similar list I made two years ago

a similar post 2 years ago, but a lot has changed since then, thus v2!

Foundational Work

youtube tutorial (I hear this is best watched after reading the paper, and adds additional clarity)

Deeply engage with:

All the ideas in the overview section, especially:

Understanding the residual stream and why it's fundamental.

And understanding attention heads: what a QK and OV matrix is, how attention heads are independent and additive and how attention and OV are semi-independent.

Induction heads, esp why this is K-Composition (and how that's different from Q & V composition), how the circuit works mechanistically, and why this is too hard to do in a 1L model

Skim or skip:

Eigenvalues or tensor products. They have the worst effort per unit insight of the paper and aren't very important.

Superposition

Toy models of superpositio...

...more

Share AF - An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 by Neel Nanda

Sign up to save your podcasts

AF - An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 by Neel Nanda

AF - An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 by Neel Nanda