July 16, 2024

“I found >800 orthogonal ‘write code’ steering vectors” by Jacob G-W, TurnTrout

13 minutes

This is a link post.

Produced as part of the MATS Summer 2024 program, under the mentorship of Alex Turner (TurnTrout).

A few weeks ago, I stumbled across a very weird fact: it is possible to find multiple steering vectors in a language model that activate very similar behaviors while all being orthogonal. This was pretty surprising to me and to some people that I talked to, so I decided to write a post about it. I don't currently have the bandwidth to investigate this much more, so I'm just putting this post and the code up.

I'll first discuss how I found these orthogonal steering vectors, then share some results. Finally, I'll discuss some possible explanations for what is happening.

Methodology

My work here builds upon Mechanistically Eliciting Latent Behaviors in Language Models (MELBO). I use MELBO to find steering vectors. Once I have a MELBO vector, I then [...]

---

Outline:

(00:50) Methodology

(03:30) Results

(08:46) Hypotheses for what is happening

(11:33) Conclusion

(12:11) Appendix: other orthogonal vectors

(12:22) becomes an alien species vector results

(12:43) STEM problem vector results

(13:07) jailbreak vector results

The original text contained 9 images which were described by AI.

---

First published:

July 15th, 2024

Source:

https://www.lesswrong.com/posts/CbSEZSpjdpnvBcEvc/i-found-greater-than-800-orthogonal-write-code-steering

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

July 16, 2024

“I found >800 orthogonal ‘write code’ steering vectors” by Jacob G-W, TurnTrout

13 minutes

This is a link post.

Produced as part of the MATS Summer 2024 program, under the mentorship of Alex Turner (TurnTrout).

I'll first discuss how I found these orthogonal steering vectors, then share some results. Finally, I'll discuss some possible explanations for what is happening.

Methodology

My work here builds upon Mechanistically Eliciting Latent Behaviors in Language Models (MELBO). I use MELBO to find steering vectors. Once I have a MELBO vector, I then [...]

---

Outline:

(00:50) Methodology

(03:30) Results

(08:46) Hypotheses for what is happening

(11:33) Conclusion

(12:11) Appendix: other orthogonal vectors

(12:22) becomes an alien species vector results

(12:43) STEM problem vector results

(13:07) jailbreak vector results

The original text contained 9 images which were described by AI.

---

First published:

July 15th, 2024

Source:

https://www.lesswrong.com/posts/CbSEZSpjdpnvBcEvc/i-found-greater-than-800-orthogonal-write-code-steering

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,144 Listeners

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat

7,231 Listeners

Dwarkesh Podcast

577 Listeners

The Ezra Klein Show

16,139 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “I found >800 orthogonal ‘write code’ steering vectors” by Jacob G-W, TurnTrout

Sign up to save your podcasts

“I found >800 orthogonal ‘write code’ steering vectors” by Jacob G-W, TurnTrout

“I found >800 orthogonal ‘write code’ steering vectors” by Jacob G-W, TurnTrout

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi