November 08, 2024

“Analyzing how SAE features evolve across a forward pass” by bensenberner, danibalcells, Michael Oesterle, Ediz Ucar, StefanHex

2 minutes

This is a link post.

This research was completed for the Supervised Program for Alignment Research (SPAR) summer 2024 iteration. The team was supervised by @Stefan Heimersheim (Apollo Research). Find out more about the program and upcoming iterations here.

TL,DR: We look for related SAE features, purely based on statistical correlations. We consider this a cheap method to estimate e.g. how many new features there are in a layer and how many features are passed through from previous layers (similar to the feature lifecycle in Anthropic's Crosscoders). We find communities of related features, and features that appear to be quasi-boolean combinations of previous features.

Here's a web interface showcasing our feature graphs.

Communities of sparse features through a forward pass. Nodes represent residual stream SAE features that were active in the residual stream for a specific prompt of text. The rows of the graph correspond to layers in GPT-2 (the bottom [...]

---

First published:

November 7th, 2024

Source:

https://www.lesswrong.com/posts/5DauDzGC8KdvDRwSd/analyzing-how-sae-features-evolve-across-a-forward-pass

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more