LessWrong (30+ Karma)

“Jailbreak steering generalization” by Nina Rimsky, Sarah Ball


Listen Later

This is a link post.

This work was performed as part of SPAR

We use activation steering (Turner et al., 2023; Rimsky et al., 2023) to investigate whether different types of jailbreaks operate via similar internal mechanisms. We find preliminary evidence that they may.

Our analysis includes a wide range of jailbreaks such as harmful prompts developed in Wei et al. 2024, the universal jailbreak in Zou et al. (2023b), and the payload split jailbreak in Kang et al. (2023). For all our experiments we use the Vicuna 13B v1.5 model.

In a first step, we produce jailbreak vectors for each jailbreak type by contrasting the internal activations of jailbreak and non-jailbreak versions of the same request (Rimsky et al., 2023; Zou et al., 2023a).

Examples of prompt pairs used for jailbreak vector generation

Interestingly, we find that steering with mean-difference jailbreak vectors from one cluster of jailbreaks helps to [...]

The original text contained 1 footnote which was omitted from this narration.

---

First published:

June 20th, 2024

Source:

https://www.lesswrong.com/posts/pZmFcKWZ4dXJaPPPa/jailbreak-steering-generalization

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,882 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,216 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

533 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,223 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners