January 08, 2023

AF - 200 COP in MI: Image Model Interpretability by Neel Nanda

9 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 200 COP in MI: Image Model Interpretability, published by Neel Nanda on January 8, 2023 on The AI Alignment Forum.

This is the eighth post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here, then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. Look up jargon in my Mechanistic Interpretability Explainer

I’ll make another post every 1-2 days, giving a new category of open problems. If you want to read ahead, check out the draft sequence here!

Motivating papers: Thread: Circuits, Multimodal Neurons in Artificial Neural Networks

Disclaimer: My area of expertise is language model interpretability, not image models - it would completely not surprise me if this section contains errors, or if there are a lot of great open problems that I’ve missed!

Background

A lot of the early work in mechanistic interpretability was focused on reverse engineering image classification models, especially Inceptionv1 (GoogLeNet). This work was largely (but not entirely!) led by Chris Olah and the OpenAI interpretability team. They got a lot of fascinating results, most notably (in my opinion):

Finding a technique called Feature Visualization to visualize what neurons are “looking at”, essentially creating a picture that represents what most activates a given neuron

Intuitively, this technique exploits the fact that each neuron is basically a function that maps an image to a scalar (the neuron activation). Images live in a continuous space (we can vary a pixel by an infinitesimal amount to slightly change the image), so we can do gradient descent on the image to find a max activating image

Curve Circuits - they reverse engineered a 50,000 ish parameter circuit used to form curve detecting neurons, and understood it well enough that they could hand-code the weights to the neurons, insert in these hand-coded neurons, and (mostly) recover the original performance.

Multimodal Neurons - they looked at CLIP (a model that takes in an image and a caption and outputs a score for how well they match) and found a bunch of fascinating abstract neurons that seemed to activate on concepts and on things related to that concept - eg a Donald Trump neuron that actives on his picture and on MAGA hats, or neurons corresponding to Halloween, or anime, or Catholicism

Motivation

I think the image interpretability results are awesome, and one of the main things that convinced me that reverse engineering neural networks was even possible! But also, very few people worked on these questions! There was enough work done to give a good base to build off of and to expose a lot of dangling threads, but also a lot of open questions left.

My personal goal with mech interp is to get good enough at understanding systems that we can eventually understand what’s going on in a human-level frontier model, and use this to help align it. From this perspective, is it worth continuing image circuits work? This is not obvious to me! I think language models (and to a lesser degree transformers) are far more likely to be a core part of how we get transformative AI (though I do expect transformative AI to have significant multimodal components), and most of the mech interp field is now focused on LLMs as a result.

But I also think that just any progress on reverse engineering neural networks is good. And at least some insights transfer. Though there are obviously a lot of differences - Inception has a continuous rather than discrete input space, doesn’t have attention heads or a residual stream, and is doing classification rather than generation. I’m personally most excited about image circuits work driving towards fundamental questions about reverse engineering networks:

How architecture or model specific is reverse eng...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

January 08, 2023

AF - 200 COP in MI: Image Model Interpretability by Neel Nanda

9 minutes

I’ll make another post every 1-2 days, giving a new category of open problems. If you want to read ahead, check out the draft sequence here!

Motivating papers: Thread: Circuits, Multimodal Neurons in Artificial Neural Networks

Background

Finding a technique called Feature Visualization to visualize what neurons are “looking at”, essentially creating a picture that represents what most activates a given neuron

Motivation

How architecture or model specific is reverse eng...

...more

Share AF - 200 COP in MI: Image Model Interpretability by Neel Nanda

Sign up to save your podcasts

AF - 200 COP in MI: Image Model Interpretability by Neel Nanda

AF - 200 COP in MI: Image Model Interpretability by Neel Nanda