Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems, published by Sonia Joseph on March 13, 2024 on The AI Alignment Forum.
Join our Discord
here.
This article was written by Sonia Joseph, in collaboration with Neel Nanda, and incubated in Blake Richards's lab at Mila and in the MATS community. Thank you to the Prisma core contributors, including Praneet Suresh, Rob Graham, and Yash Vadi.
Full acknowledgements of contributors are at the end. I am grateful to my collaborators for their guidance and feedback.
Outline
Part One: Introduction and Motivation
Part Two: Tutorial Notebooks
Part Three: Brief ViT Overview
Part Four: Demo of Prisma's Functionality
Key features, including logit attribution, attention head visualization, and activation patching.
Preliminary research results obtained using Prisma, including emergent segmentation maps and canonical attention heads.
Part Five: FAQ, including Key Differences between Vision and Language Mechanistic Interpretability
Part Six: Getting Started with Vision Mechanistic Interpretability
Part Seven: How to Get Involved
Part Eight: Open Problems in Vision Mechanistic Interpretability
Introducing the Prisma Library for Multimodal Mechanistic Interpretability
I am excited to share with the mechanistic interpretability and alignment communities a project I've been working on for the last few months. Prisma is a multimodal mechanistic interpretability library based on
TransformerLens, currently supporting vanilla vision transformers (ViTs) and their vision-text counterparts CLIP.
With recent rapid releases of multimodal models, including Sora, Gemini, and Claude 3, it is crucial that interpretability and safety efforts remain in tandem. While language mechanistic interpretability already has strong conceptual foundations, many research papers, and a thriving community, research in non-language modalities lags behind.
Given that multimodal capabilities will be part of AGI, field-building in mechanistic interpretability for non-language modalities is crucial for safety and alignment.
The goal of Prisma is to make research in mechanistic interpretability for multimodal models both easy and fun. We are also building a strong and collaborative open source research community around Prisma.
You can join our Discord here.
This post includes a brief overview of the library, fleshes out some concrete problems, and gives steps for people to get started.
Prisma Goals
Build shared infrastructure (Prisma) to make it easy to run standard language mechanistic interpretability techniques on non-language modalities, starting with vision.
Build shared conceptual foundation for multimodal mechanistic interpretability.
Shape and execute on research agenda for multimodal mechanistic interpretability.
Build an amazing multimodal mechanistic interpretability subcommunity, inspired by current efforts in language.
Set the cultural norms of this subcommunity to be highly collaborative, curious, inventive, friendly, respectful, prolific, and safety/alignment-conscious.
Encourage sharing of early/scrappy research results on Discord/Less Wrong.
Co-create a web of high-quality research.
Tutorial Notebooks
To get started, you can check out three tutorial notebooks that show how Prisma works.
Main ViT Demo
Overview of main mechanistic interpretability technique on a ViT, including direct logit attribution, attention head visualization, and activation patching. The activation patching switches the net's prediction from tabby cat to Border collie with a minimum ablation.
Emoji Logit Lens
Deeper dive into layer- and patch-level predictions with interactive plots.
Interactive Attention Head Tour
Deeper dive into the various types of attention heads a ViT contains with interactive JavaScript.
Brief ViT Overview
A
vision transf...