
Sign up to save your podcasts
Or
TL;DR
---
Outline:
(00:05) TL;DR
(01:39) Introduction
(03:49) How to use
(03:53) Basic method
(04:48) Metrics for choosing scale and evaluating
(05:58) Self-Similarity
(07:44) Entropy
(09:33) Composite
(10:36) Other experiments
(11:18) Evaluation
(15:55) Limitations and improvements
(15:59) Recovering the activating token
(18:42) Failure detection
(19:04) Maximum Self-Similarity Thresholding
(19:44) Repeat Prompt Failure Detection
(20:08) Layer-Specific Thresholds
(20:57) Prior work
(21:55) Examples
(21:58) Gemma 2B
(22:02) Random simple features
(23:02) More complex features
(25:05) Phi-3 Mini
(25:25) Random features
(26:18) Refusal features
(26:48) Acknowledgements
(27:03) Appendix
(27:06) Entropy justification
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TL;DR
---
Outline:
(00:05) TL;DR
(01:39) Introduction
(03:49) How to use
(03:53) Basic method
(04:48) Metrics for choosing scale and evaluating
(05:58) Self-Similarity
(07:44) Entropy
(09:33) Composite
(10:36) Other experiments
(11:18) Evaluation
(15:55) Limitations and improvements
(15:59) Recovering the activating token
(18:42) Failure detection
(19:04) Maximum Self-Similarity Thresholding
(19:44) Repeat Prompt Failure Detection
(20:08) Layer-Specific Thresholds
(20:57) Prior work
(21:55) Examples
(21:58) Gemma 2B
(22:02) Random simple features
(23:02) More complex features
(25:05) Phi-3 Mini
(25:25) Random features
(26:18) Refusal features
(26:48) Acknowledgements
(27:03) Appendix
(27:06) Entropy justification
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,420 Listeners
2,387 Listeners
7,893 Listeners
4,126 Listeners
87 Listeners
1,458 Listeners
9,040 Listeners
87 Listeners
390 Listeners
5,431 Listeners
15,216 Listeners
474 Listeners
121 Listeners
75 Listeners
459 Listeners