Audio note: this article contains 449 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Based off research performed in the MATS 5.1 extension program, under the mentorship of Alex Turner (TurnTrout). Research supported by a grant from the Long-Term Future Fund.
Summary I consider deep causal transcoders (DCTs) with various activation functions [...]
---
Outline:
(05:40) Introduction
(07:16) Related work
(09:59) Theory
(18:28) Method
(22:04) Fitting a Linear MLP
(25:08) Fitting a Quadratic MLP
(30:12) Alternative formulation of tensor decomposition objective: causal importance minus similarity penalty
(34:03) Fitting an Exponential MLP
(36:35) On the role of _R_
(37:37) Relation to original MELBO objective
(38:54) Calibrating _R_
(41:44) Case Study: Learning Jailbreak Vectors
(41:49) Generalization of linear, quadratic and exponential DCTs
(51:49) Evidence for multiple harmless directions
(56:00) Many loosely-correlated DCT features elicit jailbreaks
(59:44) Averaging doesnt improve generalization when we add features to the residual stream
(01:01:05) Averaging does improve jailbreak scores when we ablate features
(01:03:15) Ablating (averaged) target-layer features also works
(01:04:49) Deeper models: constant depth horizon (_t-s_) suffices for learning jailbreaks
(01:09:32) Application: Jailbreaking Representation-Rerouted Mistral-7B
(01:17:15) Application: Eliciting Capabilities in Password-Locked Models
(01:18:42) Future Work
(01:19:01) Studying feature multiplicity
(01:20:03) Quantifying a broader range of behaviors
(01:20:24) Effect of pre-training hyper-parameters
(01:22:05) Acknowledgements
(01:22:16) Appendix
(01:22:19) Hessian auto diff details
The original text contained 26 footnotes which were omitted from this narration.
The original text contained 2 images which were described by AI.
---