June 21, 2024

“Interpreting and Steering Features in Images” by Gytis Daujotas

Listen Later

9 minutes

We trained a SAE to find sparse features in image embeddings. We found many meaningful, interpretable, and steerable features. We find that steering image diffusion works surprisingly well and yields predictable and high-quality generations.

You can see the feature library here. We also have an intervention playground you can try.

Key Results

We can extract interpretable features from CLIP image embeddings.
We observe a diverse set of features, e.g. golden retrievers, there being two of something, image borders, nudity, and stylistic effects.
Editing features allows for conceptual and semantic changes while maintaining generation quality and coherency.
We devise a way to preview the causal impact of a feature, and show that many features have an explanation that is consistent with what they activate for and what they cause.
Many feature edits can be stacked to perform task-relevant operations, like transferring a subject, mixing in a specific property of [...]

---

Outline:

(00:30) Key Results

(01:20) Interactive demo

(01:33) Introduction

(02:22) Steering Features

(04:22) Discovering and Interpreting Features

(06:22) Autointerpretation Labels

(07:13) Training Details

(08:34) Future Work

---

First published:

June 20th, 2024

Source:

https://www.lesswrong.com/posts/Quqekpvx8BGMMcaem/interpreting-and-steering-features-in-images

---

Narrated by TYPE III AUDIO.

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

LessWrong (30+ Karma)

By LessWrong

June 21, 2024

“Interpreting and Steering Features in Images” by Gytis Daujotas

Listen Later

9 minutes

We trained a SAE to find sparse features in image embeddings. We found many meaningful, interpretable, and steerable features. We find that steering image diffusion works surprisingly well and yields predictable and high-quality generations.

You can see the feature library here. We also have an intervention playground you can try.

Key Results

We can extract interpretable features from CLIP image embeddings.
We observe a diverse set of features, e.g. golden retrievers, there being two of something, image borders, nudity, and stylistic effects.
Editing features allows for conceptual and semantic changes while maintaining generation quality and coherency.
We devise a way to preview the causal impact of a feature, and show that many features have an explanation that is consistent with what they activate for and what they cause.
Many feature edits can be stacked to perform task-relevant operations, like transferring a subject, mixing in a specific property of [...]

---

Outline:

(00:30) Key Results

(01:20) Interactive demo

(01:33) Introduction

(02:22) Steering Features

(04:22) Discovering and Interpreting Features

(06:22) Autointerpretation Labels

(07:13) Training Details

(08:34) Future Work

---

First published:

June 20th, 2024

Source:

https://www.lesswrong.com/posts/Quqekpvx8BGMMcaem/interpreting-and-steering-features-in-images

---

Narrated by TYPE III AUDIO.

...more

More shows like LessWrong (30+ Karma)

The Daily by The New York Times

The Daily

113,393 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,268 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

529 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,306 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners