The Nonlinear Library: Alignment Forum

AF - Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders by Johnny Lin


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders, published by Johnny Lin on March 25, 2024 on The AI Alignment Forum.
This posts assumes basic familiarity with Sparse Autoencoders. For those unfamiliar with this technique, we highly recommend the introductory sections of
these
papers.
TL;DR
Neuronpedia is a platform for mechanistic interpretability research. It was previously focused on crowdsourcing explanations of neurons, but we've pivoted to accelerating researchers for Sparse Autoencoders (SAEs) by hosting models, feature dashboards, data visualizations, tooling, and more.
Important Links
Explore: The SAE research focused Neuronpedia. Current SAEs for GPT2-Small:
RES-JB: Residuals - Joseph Bloom (294k feats)
ATT-KK: Attention Out - Connor Kissane + Robert Kryzanowski (344k feats)
Upload: Get your SAEs hosted by Neuronpedia: fill out this <10 minute application
Participate: Join #neuronpedia on the Open Source Mech Interp Slack
Neuronpedia has received 1 year of funding. Johnny Lin is full-time on engineering, design, and product, while Joseph Bloom is supporting with high-level direction and product management. We'd love to talk to you about how Neuronpedia can speed up your SAE research. Fill out
this short form to get in touch.
Introduction
Why Neuronpedia?
While SAEs are exciting, they introduce a number of engineering challenges which can slow or even prevent research from being done. These challenges arise because SAEs decompose neural networks into numerous distinct features which we are trying to understand.
Even when working with relatively small language models like GPT2-Small, there are challenges in producing and storing feature dashboards, automatically interpreting features and organizing insights we collect about features.
As the community produces more SAEs, trains SAEs on larger models and develops better techniques for understanding and analyzing these artifacts, there's going to be an ever increasing number of engineering challenges in the way of research into model internals.
What It's Already Useful For
Generating and hosting more than 600k SAE feature dashboards. This includes
GPT2-small residual stream SAEs for Joseph Bloom (
RES-JB) and
attention out SAEs for Connor Kissane and Robert Kryzanowski (
ATT-KK).
Running Python servers to enable prompt-based feature search and feature specific prompt testing. This makes it easier for researchers and the community to qualitatively test features, write up results and enable researchers to share SAEs.
Generating automatic explanations of features with GPT3.5-Turbo, which can be navigated via U-MAPS. The automatic explanations are not currently amazing, but hopefully we can iterate and improve on this.
Making It More Useful
Researchers / Labs publish SAEs that are uploaded to Neuronpedia: There are many different labs training SAEs so it's a safe bet that there will be many SAEs in the public domain in the near future. We make it easy for researchers to host their SAEs and enable functionalities like feature testing, UMAPs, and more.
Generate feature dashboards, automatic interpretability and run benchmarks: Feature dashboards are currently being produced using
Callum's reproduction of the Anthropic feature dashboards. Plausibly these can be improved over time to provide richer insights into features. Automatic explanations of features are provided using
OpenAI's method. We'll explore adding SAE benchmarks, which seem like an important part of quality control for SAEs.
Work with researchers to add functionality that accelerates research: Given the scale of SAEs, there are likely lots of ways we can make it easier to conduct or share research with the platform. This might involve things like developing more functionality for circuit style analysis or supporting nov...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners