
Sign up to save your podcasts
Or
TL;DR
---
Outline:
(00:05) TL;DR
(01:39) Introduction
(03:49) How to use
(03:53) Basic method
(04:48) Metrics for choosing scale and evaluating
(05:58) Self-Similarity
(07:44) Entropy
(09:33) Composite
(10:36) Other experiments
(11:18) Evaluation
(15:55) Limitations and improvements
(15:59) Recovering the activating token
(18:42) Failure detection
(19:04) Maximum Self-Similarity Thresholding
(19:44) Repeat Prompt Failure Detection
(20:08) Layer-Specific Thresholds
(20:57) Prior work
(21:55) Examples
(21:58) Gemma 2B
(22:02) Random simple features
(23:02) More complex features
(25:05) Phi-3 Mini
(25:25) Random features
(26:18) Refusal features
(26:48) Acknowledgements
(27:03) Appendix
(27:06) Entropy justification
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TL;DR
---
Outline:
(00:05) TL;DR
(01:39) Introduction
(03:49) How to use
(03:53) Basic method
(04:48) Metrics for choosing scale and evaluating
(05:58) Self-Similarity
(07:44) Entropy
(09:33) Composite
(10:36) Other experiments
(11:18) Evaluation
(15:55) Limitations and improvements
(15:59) Recovering the activating token
(18:42) Failure detection
(19:04) Maximum Self-Similarity Thresholding
(19:44) Repeat Prompt Failure Detection
(20:08) Layer-Specific Thresholds
(20:57) Prior work
(21:55) Examples
(21:58) Gemma 2B
(22:02) Random simple features
(23:02) More complex features
(25:05) Phi-3 Mini
(25:25) Random features
(26:18) Refusal features
(26:48) Acknowledgements
(27:03) Appendix
(27:06) Entropy justification
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,409 Listeners
2,387 Listeners
7,908 Listeners
4,131 Listeners
87 Listeners
1,457 Listeners
9,042 Listeners
87 Listeners
388 Listeners
5,432 Listeners
15,216 Listeners
474 Listeners
122 Listeners
75 Listeners
458 Listeners