April 14, 2025

“One-shot steering vectors cause emergent misalignment, too” by Jacob Dunefsky

21 minutes

TL;DR: If we optimize a steering vector to induce a language model to output a single piece of harmful code on a single training example, then applying this vector to unrelated open-ended questions increases the probability that the model yields harmful output.

Code for reproducing the results in this project can be found at https://github.com/jacobdunefsky/one-shot-steering-misalignment.

Intro

Somewhat recently, Betley et al. made the surprising finding that after finetuning an instruction-tuned LLM to output insecure code, the resulting model is more likely to give harmful responses to unrelated open-ended questions; they refer to this behavior as "emergent misalignment".

My own recent research focus has been on directly optimizing steering vectors on a single input and seeing if they mediate safety-relevant behavior. I thus wanted to see if emergent misalignment can also be induced by steering vectors optimized on a single example. That is to say: does a steering vector optimized [...]

---

Outline:

(00:31) Intro

(01:22) Why care?

(03:01) How we optimized our steering vectors

(05:01) Evaluation method

(06:05) Results

(06:09) Alignment scores of steered generations

(07:59) Resistance is futile: counting misaligned strings

(09:29) Is there a single, simple, easily-locatable representation of misalignment? Some preliminary thoughts

(13:29) Does increasing steering strength increase misalignment?

(15:41) Why do harmful code vectors induce more general misalignment? A hypothesis

(17:24) What have we learned, and where do we go from here?

(19:49) Appendix: how do we obtain our harmful code steering vectors?

The original text contained 1 footnote which was omitted from this narration.

---

First published:

April 14th, 2025

Source:

https://www.lesswrong.com/posts/kcKnKHTHycHeRhcHF/one-shot-steering-vectors-cause-emergent-misalignment-too

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

April 14, 2025

“One-shot steering vectors cause emergent misalignment, too” by Jacob Dunefsky

21 minutes

Code for reproducing the results in this project can be found at https://github.com/jacobdunefsky/one-shot-steering-misalignment.

Intro

---

Outline:

(00:31) Intro

(01:22) Why care?

(03:01) How we optimized our steering vectors

(05:01) Evaluation method

(06:05) Results

(06:09) Alignment scores of steered generations

(07:59) Resistance is futile: counting misaligned strings

(09:29) Is there a single, simple, easily-locatable representation of misalignment? Some preliminary thoughts

(13:29) Does increasing steering strength increase misalignment?

(15:41) Why do harmful code vectors induce more general misalignment? A hypothesis

(17:24) What have we learned, and where do we go from here?

(19:49) Appendix: how do we obtain our harmful code steering vectors?

The original text contained 1 footnote which was omitted from this narration.

---

First published:

April 14th, 2025

Source:

https://www.lesswrong.com/posts/kcKnKHTHycHeRhcHF/one-shot-steering-vectors-cause-emergent-misalignment-too

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

Making Sense with Sam Harris

26,336 Listeners

Conversations with Tyler

2,384 Listeners

The Peter Attia Drive

8,004 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,125 Listeners

ManifoldOne

90 Listeners

Your Undivided Attention

1,494 Listeners

All-In with Chamath, Jason, Sacks & Friedberg

9,255 Listeners

Machine Learning Street Talk (MLST)

91 Listeners

Dwarkesh Podcast

424 Listeners

Hard Fork

5,448 Listeners

The Ezra Klein Show

15,481 Listeners

Moonshots with Peter Diamandis

505 Listeners

No Priors: Artificial Intelligence | Technology | Startups

127 Listeners

Latent Space: The AI Engineer Podcast

71 Listeners

BG2Pod with Brad Gerstner and Bill Gurley

465 Listeners

Share “One-shot steering vectors cause emergent misalignment, too” by Jacob Dunefsky

Sign up to save your podcasts

“One-shot steering vectors cause emergent misalignment, too” by Jacob Dunefsky

“One-shot steering vectors cause emergent misalignment, too” by Jacob Dunefsky

More shows like LessWrong (30+ Karma)

Making Sense with Sam Harris

Conversations with Tyler

The Peter Attia Drive

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

ManifoldOne

Your Undivided Attention

All-In with Chamath, Jason, Sacks & Friedberg

Machine Learning Street Talk (MLST)

Dwarkesh Podcast

Hard Fork

The Ezra Klein Show

Moonshots with Peter Diamandis

No Priors: Artificial Intelligence | Technology | Startups

Latent Space: The AI Engineer Podcast

BG2Pod with Brad Gerstner and Bill Gurley