March 25, 2024

AF - Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders by Johnny Lin

12 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders, published by Johnny Lin on March 25, 2024 on The AI Alignment Forum.

This posts assumes basic familiarity with Sparse Autoencoders. For those unfamiliar with this technique, we highly recommend the introductory sections of

these

papers.

TL;DR

Neuronpedia is a platform for mechanistic interpretability research. It was previously focused on crowdsourcing explanations of neurons, but we've pivoted to accelerating researchers for Sparse Autoencoders (SAEs) by hosting models, feature dashboards, data visualizations, tooling, and more.

Important Links

Explore: The SAE research focused Neuronpedia. Current SAEs for GPT2-Small:

RES-JB: Residuals - Joseph Bloom (294k feats)

ATT-KK: Attention Out - Connor Kissane + Robert Kryzanowski (344k feats)

Upload: Get your SAEs hosted by Neuronpedia: fill out this <10 minute application

Participate: Join #neuronpedia on the Open Source Mech Interp Slack

Neuronpedia has received 1 year of funding. Johnny Lin is full-time on engineering, design, and product, while Joseph Bloom is supporting with high-level direction and product management. We'd love to talk to you about how Neuronpedia can speed up your SAE research. Fill out

this short form to get in touch.

Introduction

Why Neuronpedia?

While SAEs are exciting, they introduce a number of engineering challenges which can slow or even prevent research from being done. These challenges arise because SAEs decompose neural networks into numerous distinct features which we are trying to understand.

Even when working with relatively small language models like GPT2-Small, there are challenges in producing and storing feature dashboards, automatically interpreting features and organizing insights we collect about features.

As the community produces more SAEs, trains SAEs on larger models and develops better techniques for understanding and analyzing these artifacts, there's going to be an ever increasing number of engineering challenges in the way of research into model internals.

What It's Already Useful For

Generating and hosting more than 600k SAE feature dashboards. This includes

GPT2-small residual stream SAEs for Joseph Bloom (

RES-JB) and

attention out SAEs for Connor Kissane and Robert Kryzanowski (

ATT-KK).

Running Python servers to enable prompt-based feature search and feature specific prompt testing. This makes it easier for researchers and the community to qualitatively test features, write up results and enable researchers to share SAEs.

Generating automatic explanations of features with GPT3.5-Turbo, which can be navigated via U-MAPS. The automatic explanations are not currently amazing, but hopefully we can iterate and improve on this.

Making It More Useful

Researchers / Labs publish SAEs that are uploaded to Neuronpedia: There are many different labs training SAEs so it's a safe bet that there will be many SAEs in the public domain in the near future. We make it easy for researchers to host their SAEs and enable functionalities like feature testing, UMAPs, and more.

Generate feature dashboards, automatic interpretability and run benchmarks: Feature dashboards are currently being produced using

Callum's reproduction of the Anthropic feature dashboards. Plausibly these can be improved over time to provide richer insights into features. Automatic explanations of features are provided using

OpenAI's method. We'll explore adding SAE benchmarks, which seem like an important part of quality control for SAEs.

Work with researchers to add functionality that accelerates research: Given the scale of SAEs, there are likely lots of ways we can make it easier to conduct or share research with the platform. This might involve things like developing more functionality for circuit style analysis or supporting nov...

...more

View all episodes

By The Nonlinear Fund

March 25, 2024

AF - Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders by Johnny Lin

12 minutes

This posts assumes basic familiarity with Sparse Autoencoders. For those unfamiliar with this technique, we highly recommend the introductory sections of

these

papers.

TL;DR

Important Links

Explore: The SAE research focused Neuronpedia. Current SAEs for GPT2-Small:

RES-JB: Residuals - Joseph Bloom (294k feats)

ATT-KK: Attention Out - Connor Kissane + Robert Kryzanowski (344k feats)

Upload: Get your SAEs hosted by Neuronpedia: fill out this <10 minute application

Participate: Join #neuronpedia on the Open Source Mech Interp Slack

this short form to get in touch.

Introduction

Why Neuronpedia?

What It's Already Useful For

Generating and hosting more than 600k SAE feature dashboards. This includes

GPT2-small residual stream SAEs for Joseph Bloom (

RES-JB) and

attention out SAEs for Connor Kissane and Robert Kryzanowski (

ATT-KK).

Making It More Useful

Generate feature dashboards, automatic interpretability and run benchmarks: Feature dashboards are currently being produced using

Callum's reproduction of the Anthropic feature dashboards. Plausibly these can be improved over time to provide richer insights into features. Automatic explanations of features are provided using

OpenAI's method. We'll explore adding SAE benchmarks, which seem like an important part of quality control for SAEs.

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders by Johnny Lin

Sign up to save your podcasts

AF - Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders by Johnny Lin

AF - Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders by Johnny Lin

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast