
Sign up to save your podcasts
Or
A few days ago, Gray Swan published code and models for their recent “circuit breakers” method for language models.[1]1
The circuit breakers method defends against jailbreaks by training the model to erase “bad” internal representations. We are very excited about data-efficient defensive methods like this, especially those which use interpretability concepts or tools.
At the link, we briefly investigate three topics:
---
First published:
Source:
Narrated by TYPE III AUDIO.
A few days ago, Gray Swan published code and models for their recent “circuit breakers” method for language models.[1]1
The circuit breakers method defends against jailbreaks by training the model to erase “bad” internal representations. We are very excited about data-efficient defensive methods like this, especially those which use interpretability concepts or tools.
At the link, we briefly investigate three topics:
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,420 Listeners
2,387 Listeners
7,893 Listeners
4,132 Listeners
87 Listeners
1,459 Listeners
9,040 Listeners
87 Listeners
390 Listeners
5,431 Listeners
15,216 Listeners
476 Listeners
121 Listeners
75 Listeners
459 Listeners