
Sign up to save your podcasts
Or
This research was completed for the Supervised Program for Alignment Research (SPAR) summer 2024 iteration. The team was supervised by @Stefan Heimersheim (Apollo Research). Find out more about the program and upcoming iterations here.
TL,DR: We look for related SAE features, purely based on statistical correlations. We consider this a cheap method to estimate e.g. how many new features there are in a layer and how many features are passed through from previous layers (similar to the feature lifecycle in Anthropic's Crosscoders). We find communities of related features, and features that appear to be quasi-boolean combinations of previous features.
Here's a web interface showcasing our feature graphs.
Communities of sparse features through a forward pass. Nodes represent residual stream SAE features that were active in the residual stream for a specific prompt of text. The rows of the graph correspond to layers in GPT-2 (the bottom [...]---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This research was completed for the Supervised Program for Alignment Research (SPAR) summer 2024 iteration. The team was supervised by @Stefan Heimersheim (Apollo Research). Find out more about the program and upcoming iterations here.
TL,DR: We look for related SAE features, purely based on statistical correlations. We consider this a cheap method to estimate e.g. how many new features there are in a layer and how many features are passed through from previous layers (similar to the feature lifecycle in Anthropic's Crosscoders). We find communities of related features, and features that appear to be quasi-boolean combinations of previous features.
Here's a web interface showcasing our feature graphs.
Communities of sparse features through a forward pass. Nodes represent residual stream SAE features that were active in the residual stream for a specific prompt of text. The rows of the graph correspond to layers in GPT-2 (the bottom [...]---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,359 Listeners
2,382 Listeners
7,947 Listeners
4,135 Listeners
87 Listeners
1,449 Listeners
9,041 Listeners
88 Listeners
377 Listeners
5,420 Listeners
15,180 Listeners
474 Listeners
121 Listeners
77 Listeners
455 Listeners