August 05, 2024

LW - Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders by Gytis Daujotas

13 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders, published by Gytis Daujotas on August 5, 2024 on LessWrong.

Click here to open a live research preview where you can try interventions using this SAE.

This is a follow-up to a previous post on finding interpretable and steerable features in CLIP.

Motivation

Modern image diffusion models often use CLIP in order to condition generation. Put simply, users use CLIP to embed prompts or images, and these embeddings are used to diffuse another image back out.

Despite this, image models have severe user interface limitations. We already know that CLIP has a rich inner world model, but it's often surprisingly hard to make precise tweaks or reference specific concepts just by prompting alone. Similar prompts often yield a different image, or when we have a specific idea in mind, it can be too hard to find the right string of words to elicit the right concepts we need.

If we're able to understand the internal representation that CLIP uses to encode information about images, we might be able to get more expressive tools and mechanisms to guide generation and steer it without using any prompting. In the ideal world, this would enable the ability to make fine adjustments or even reference particular aspects of style or content without needing to specify what we want in language.

We could instead leverage CLIP's internal understanding to pick and choose what concepts to include, like a palette or a digital synthesizer.

It would also enable us to learn something about how image models represent the world, and how humans can interact with and use this representation, thereby skipping the text encoder and manipulating the model's internal state directly.

Introduction

CLIP is a neural network commonly used to guide image diffusion. A Sparse Autoencoder was trained on the dense image embeddings CLIP produces to transform it into a sparse representation of active features. These features seem to represent individual units of meaning. They can also be manipulated in groups - combinations of multiple active features - that represent intuitive concepts.

These groups can be understood entirely visually, and often encode surprisingly rich and interesting conceptual detail.

By directly manipulating these groups as single units, image generation can be edited and guided without using prompting or language input. Concepts that were difficult to specify or edit by text prompting become easy and intuitive to manipulate in this new visual representation.

Since many models use the same CLIP joint representation space that this work analyzed, this technique works to control many popular image models out of the box.

Summary of Results

Any arbitrary image can be decomposed into its constituent concepts. Many concepts (groups of features) that we find seem to slice images up into a fairly natural ontology of their human interpretable components. We find grouping them together is an effective approach to yield a more interpretable and useful grain of control.

These concepts can be used like knobs to steer generation in leading models like Stable Cascade. Many concepts have an obvious visual meaning yet are hard to precisely label in language, which suggests that studying CLIP's internal representations can be used as a lens into the variety of the visual domain. Tweaking the activations of these concepts can be used to expressively steer and guide generation in multiple image diffusion models that we tried.

We released the weights and a live demo of controlling image generation in feature space. By analyzing a SAE trained on CLIP, we get a much more vivid picture of the rich understanding that CLIP learns. We hope this is just the beginning of more effective and useful interventions in the internal representations of n...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

August 05, 2024

LW - Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders by Gytis Daujotas

13 minutes

Click here to open a live research preview where you can try interventions using this SAE.

This is a follow-up to a previous post on finding interpretable and steerable features in CLIP.

Motivation

Modern image diffusion models often use CLIP in order to condition generation. Put simply, users use CLIP to embed prompts or images, and these embeddings are used to diffuse another image back out.

We could instead leverage CLIP's internal understanding to pick and choose what concepts to include, like a palette or a digital synthesizer.

Introduction

These groups can be understood entirely visually, and often encode surprisingly rich and interesting conceptual detail.

Since many models use the same CLIP joint representation space that this work analyzed, this technique works to control many popular image models out of the box.

Summary of Results

...more

Share LW - Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders by Gytis Daujotas

Sign up to save your podcasts

LW - Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders by Gytis Daujotas

LW - Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders by Gytis Daujotas