
Sign up to save your podcasts
Or


A few days ago, Gray Swan published code and models for their recent “circuit breakers” method for language models.[1]1
The circuit breakers method defends against jailbreaks by training the model to erase “bad” internal representations. We are very excited about data-efficient defensive methods like this, especially those which use interpretability concepts or tools.
At the link, we briefly investigate three topics:
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongA few days ago, Gray Swan published code and models for their recent “circuit breakers” method for language models.[1]1
The circuit breakers method defends against jailbreaks by training the model to erase “bad” internal representations. We are very excited about data-efficient defensive methods like this, especially those which use interpretability concepts or tools.
At the link, we briefly investigate three topics:
---
First published:
Source:
Narrated by TYPE III AUDIO.

112,842 Listeners

130 Listeners

7,215 Listeners

531 Listeners

16,221 Listeners

4 Listeners

14 Listeners

2 Listeners