January 31, 2023

AF - Mechanistic Interpretability Quickstart Guide by Neel Nanda

9 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mechanistic Interpretability Quickstart Guide, published by Neel Nanda on January 31, 2023 on The AI Alignment Forum.

This was written as a guide for Apart Research's Mechanistic Interpretability Hackathon as a compressed version of my getting started post. The spirit is "how to speedrun your way to maybe doing something useful in mechanistic interpretability in a weekend", but hopefully this is useful even to people who aren't aiming for weekend long projects!

Quickstart

Watch my “What is a Transformer?” video

Skim through my TransformerLens demo

Copy it to a new Colab notebook (with a free GPU) to actually write your own code - do not get involved in tech setup!

Skim the Concrete Open Problems section, or my 200 Concrete Open Problems in Mech Interp sequence. Find a problem that catches your fancy, and jump in!

If you want a low-coding project,

Whenever you get stuck, refer to the getting started section and check out the relevant resource.

Introduction

Mechanistic Interpretability is the study of reverse-engineering neural networks - analogous to how we might try to reverse-engineer a program’s source code from its compiled binary, our goal is to reverse engineer the parameters of a trained neural network, and to try to reverse engineer what algorithms and internal cognition the model is actually doing. Going from knowing that it works, to understanding how it works. Check out Circuits: Zoom In for an introduction.

In my (extremely biased!) opinion, mech interp is a very exciting subfield of alignment. Currently our models are inscrutable black boxes! If we can really understand what models are thinking, and why they do what they do, then I feel much happier about living in a world with human level and beyond models, and it seems far easier to align them.

Further, it is a young field, with a lot of low-hanging fruit. And it suffices to screw around in a Colab notebook with a small-ish model that someone else trained, copying code from an existing demo - the bar for entry can be much lower than other areas of alignment. So you stand a chance of getting traction on a problem in this hackathon!

Recommended mindset

Though the bar for entry is lower for mech interp than other areas of alignment, it is still far from zero. I’ve written a post on how to get started that lays out the key prerequisites and my takes for what to do to get them. A weekend hackathon isn’t long enough to properly engage with those, so I recommend picking a problem you’re excited about, and dipping into the resources summarised here whenever you get stuck. I recommend trying to have some problem in mind, so you can direct your learning towards making progress on that goal. But it’s completely fine if, in fact, you just spend the weekend learning as much as you can - if you feel like you’ve learned cool things, then I call that a great hackathon!

Getting Started

A summary of the key resources, and how to think of them during a hackathon.

What even is a transformer? A key subskill in mech interp, is really having a deep intuition for how a transformer (the architecture for modern language models) actually works - what are the basic operations going on inside of it, and how do these all fit together?

Important: My what is a transformer tutorial video (1h)

Recommended: My tutorial on implementing GPT-2 from scratch (1.5h) plus template notebook to fill out yourself (with tests!) (2-8h). This is more involved and not essential to do fully, but will help a lot

Implementing GPT-2 from scratch can sound pretty hard, but the tutorial and template guides you through the process, and gives you tests to keep you on track. I think that once you’ve done this, you have a solidly deep understanding of transformers!

Reference: Look up unfamiliar terms in the transformers section of my explainer

Tooling: The core...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

January 31, 2023

AF - Mechanistic Interpretability Quickstart Guide by Neel Nanda

9 minutes

Quickstart

Watch my “What is a Transformer?” video

Skim through my TransformerLens demo

Copy it to a new Colab notebook (with a free GPU) to actually write your own code - do not get involved in tech setup!

Skim the Concrete Open Problems section, or my 200 Concrete Open Problems in Mech Interp sequence. Find a problem that catches your fancy, and jump in!

If you want a low-coding project,

Whenever you get stuck, refer to the getting started section and check out the relevant resource.

Introduction

Recommended mindset

Getting Started

A summary of the key resources, and how to think of them during a hackathon.

Important: My what is a transformer tutorial video (1h)

Reference: Look up unfamiliar terms in the transformers section of my explainer

Tooling: The core...

...more

Share AF - Mechanistic Interpretability Quickstart Guide by Neel Nanda

Sign up to save your podcasts

AF - Mechanistic Interpretability Quickstart Guide by Neel Nanda

AF - Mechanistic Interpretability Quickstart Guide by Neel Nanda