
Sign up to save your podcasts
Or
Audio note: this article contains 149 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
ARC recently released our first empirical paper: Estimating the Probabilities of Rare Language Model Outputs. In this work, we construct a simple setting for low probability estimation — single-token argmax sampling in transformers — and use it to compare the performance of various estimation methods. ARC views low probability estimation as a potential technique for mitigating worst-case behavior from AI, including deceptive alignment; see our previous theory post on estimating tail risks in neural networks for further discussion.
Outline of this post:
---
Outline:
(01:32) Problem statement
(03:24) Estimation methods
(04:57) Importance sampling
(05:57) Independent Token Gradient Importance Sampling (ITGIS)
(07:16) Metropolis—Hastings Importance Sampling (MHIS)
(08:50) Activation extrapolation
(09:21) Quadratic Logit Decomposition (QLD)
(11:06) Gaussian Logit Difference
(11:41) Empirical results
(14:08) Discussion
(14:11) Motivation
(16:15) Importance sampling versus activation extrapolation
(17:15) Limitations
(18:54) Related work
(19:56) Future directions
The original text contained 7 footnotes which were omitted from this narration.
The original text contained 2 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Audio note: this article contains 149 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
ARC recently released our first empirical paper: Estimating the Probabilities of Rare Language Model Outputs. In this work, we construct a simple setting for low probability estimation — single-token argmax sampling in transformers — and use it to compare the performance of various estimation methods. ARC views low probability estimation as a potential technique for mitigating worst-case behavior from AI, including deceptive alignment; see our previous theory post on estimating tail risks in neural networks for further discussion.
Outline of this post:
---
Outline:
(01:32) Problem statement
(03:24) Estimation methods
(04:57) Importance sampling
(05:57) Independent Token Gradient Importance Sampling (ITGIS)
(07:16) Metropolis—Hastings Importance Sampling (MHIS)
(08:50) Activation extrapolation
(09:21) Quadratic Logit Decomposition (QLD)
(11:06) Gaussian Logit Difference
(11:41) Empirical results
(14:08) Discussion
(14:11) Motivation
(16:15) Importance sampling versus activation extrapolation
(17:15) Limitations
(18:54) Related work
(19:56) Future directions
The original text contained 7 footnotes which were omitted from this narration.
The original text contained 2 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,366 Listeners
2,383 Listeners
7,944 Listeners
4,137 Listeners
87 Listeners
1,459 Listeners
9,050 Listeners
88 Listeners
386 Listeners
5,422 Listeners
15,220 Listeners
473 Listeners
120 Listeners
76 Listeners
456 Listeners