July 30, 2024

“Understanding Positional Features in Layer 0 SAEs” by bilalchughtai, Yeu-Tong Lau

13 minutes

This is an informal research note. It is the result of a few-day exploration into positional SAE features conducted as part of Neel Nanda's training phase of the ML Alignment & Theory Scholars Program - Summer 2024 cohort.

Thanks to Andy Arditi, Arthur Conmy and Stefan Heimersheim for helpful feedback. Thanks to Joseph Bloom for training this SAE.

Summary

Figure 1: (Dots) The top 3 PCA components of rows 1 to 127 of gpt2-small's positional embedding matrix explain 95% of their variance. (Crosses) SAEs trained on layer 0 residual stream activations learn many features that together recover this 1 dimensional helical manifold. Colour corresponds to the position on which the feature is most active. Blue corresponds to position 1, red corresponds to position 127. The position 0 row and SAE features are omitted (as they are weird).

We investigate positional SAE features learned by layer 0 residual stream SAEs trained on gpt2-small. In [...]

The original text contained 8 images which were described by AI.

---

First published:

July 29th, 2024

Source:

https://www.lesswrong.com/posts/ctGeJGHg9pbc8memF/understanding-positional-features-in-layer-0-saes

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.