
Sign up to save your podcasts
Or


This work was performed as part of SPAR
We use activation steering (Turner et al., 2023; Rimsky et al., 2023) to investigate whether different types of jailbreaks operate via similar internal mechanisms. We find preliminary evidence that they may.
Our analysis includes a wide range of jailbreaks such as harmful prompts developed in Wei et al. 2024, the universal jailbreak in Zou et al. (2023b), and the payload split jailbreak in Kang et al. (2023). For all our experiments we use the Vicuna 13B v1.5 model.
In a first step, we produce jailbreak vectors for each jailbreak type by contrasting the internal activations of jailbreak and non-jailbreak versions of the same request (Rimsky et al., 2023; Zou et al., 2023a).
Examples of prompt pairs used for jailbreak vector generationInterestingly, we find that steering with mean-difference jailbreak vectors from one cluster of jailbreaks helps to [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongThis work was performed as part of SPAR
We use activation steering (Turner et al., 2023; Rimsky et al., 2023) to investigate whether different types of jailbreaks operate via similar internal mechanisms. We find preliminary evidence that they may.
Our analysis includes a wide range of jailbreaks such as harmful prompts developed in Wei et al. 2024, the universal jailbreak in Zou et al. (2023b), and the payload split jailbreak in Kang et al. (2023). For all our experiments we use the Vicuna 13B v1.5 model.
In a first step, we produce jailbreak vectors for each jailbreak type by contrasting the internal activations of jailbreak and non-jailbreak versions of the same request (Rimsky et al., 2023; Zou et al., 2023a).
Examples of prompt pairs used for jailbreak vector generationInterestingly, we find that steering with mean-difference jailbreak vectors from one cluster of jailbreaks helps to [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.

112,882 Listeners

130 Listeners

7,216 Listeners

533 Listeners

16,223 Listeners

4 Listeners

14 Listeners

2 Listeners