LessWrong (30+ Karma)

“Estimating Tail Risk in Neural Networks” by Jacob_Hilton


Listen Later

Audio note: this article contains 292 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while an artificial intelligence (AI) assistant may be generally safe, it would be catastrophic if it ever suggested an action that resulted in unnecessary large-scale harm.

Current techniques for estimating the probability of tail events are based on finding inputs on which an AI behaves catastrophically. Since the input space is so large, it might be prohibitive to search through it thoroughly enough to detect all potential catastrophic behavior. As a result, these techniques cannot be used to produce AI systems that we are confident [...]

---

Outline:

(03:05) A Toy Scenario

(04:32) Deficiencies of Adversarial Training

(06:26) Deliberate Subversion of Adversarial Training

(08:20) A Possible Approach for Estimating Tail Risk

(10:13) Layer-by-layer Activation Modeling

(13:07) Toy Example: Finitely Supported Distributions

(14:24) Method 1: Gaussian Distribution

(16:03) Method 2: Independent Linear Features

(19:01) Relation to Other Work

(19:05) Formalizing the Presumption of Independence

(20:00) Mechanistic Interpretability

(21:02) Relaxed Adversarial Training

(21:24) Eliciting Latent Knowledge

(22:53) Conclusion

(24:25) Appendices

(24:28) Catastrophe Detectors

(28:41) Issues with Layer-by-layer Activation Modeling

(29:17) Layer-by-layer modeling forgets too much information

(31:42) Probability distributions over activations are too restrictive

(35:08) Distributional Shift

(37:48) Complexity Theoretic Barriers to Accurate Estimates

(38:38) 3-SAT

(40:20) Method 1: Assume clauses are independent

(40:57) Method 2: Condition on the number of true variables

(42:51) Method 3: Linear regression

(44:47) Main hope: estimates competitive with the AI system or the training process

The original text contained 13 footnotes which were omitted from this narration.

The original text contained 1 image which was described by AI.

---

First published:

September 13th, 2024

Source:

https://www.lesswrong.com/posts/xj5nzResmDZDqLuLo/estimating-tail-risk-in-neural-networks

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,952 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,230 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

535 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,199 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners