
Sign up to save your podcasts
Or


Audio note: this article contains 149 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
ARC recently released our first empirical paper: Estimating the Probabilities of Rare Language Model Outputs. In this work, we construct a simple setting for low probability estimation — single-token argmax sampling in transformers — and use it to compare the performance of various estimation methods. ARC views low probability estimation as a potential technique for mitigating worst-case behavior from AI, including deceptive alignment; see our previous theory post on estimating tail risks in neural networks for further discussion.
Outline of this post:
---
Outline:
(01:32) Problem statement
(03:24) Estimation methods
(04:57) Importance sampling
(05:57) Independent Token Gradient Importance Sampling (ITGIS)
(07:16) Metropolis—Hastings Importance Sampling (MHIS)
(08:50) Activation extrapolation
(09:21) Quadratic Logit Decomposition (QLD)
(11:06) Gaussian Logit Difference
(11:41) Empirical results
(14:08) Discussion
(14:11) Motivation
(16:15) Importance sampling versus activation extrapolation
(17:15) Limitations
(18:54) Related work
(19:56) Future directions
The original text contained 7 footnotes which were omitted from this narration.
The original text contained 2 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrong
Audio note: this article contains 149 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
ARC recently released our first empirical paper: Estimating the Probabilities of Rare Language Model Outputs. In this work, we construct a simple setting for low probability estimation — single-token argmax sampling in transformers — and use it to compare the performance of various estimation methods. ARC views low probability estimation as a potential technique for mitigating worst-case behavior from AI, including deceptive alignment; see our previous theory post on estimating tail risks in neural networks for further discussion.
Outline of this post:
---
Outline:
(01:32) Problem statement
(03:24) Estimation methods
(04:57) Importance sampling
(05:57) Independent Token Gradient Importance Sampling (ITGIS)
(07:16) Metropolis—Hastings Importance Sampling (MHIS)
(08:50) Activation extrapolation
(09:21) Quadratic Logit Decomposition (QLD)
(11:06) Gaussian Logit Difference
(11:41) Empirical results
(14:08) Discussion
(14:11) Motivation
(16:15) Importance sampling versus activation extrapolation
(17:15) Limitations
(18:54) Related work
(19:56) Future directions
The original text contained 7 footnotes which were omitted from this narration.
The original text contained 2 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,858 Listeners

130 Listeners

7,216 Listeners

531 Listeners

16,173 Listeners

4 Listeners

14 Listeners

2 Listeners