
Sign up to save your podcasts
Or


Audio note: this article contains 292 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while an artificial intelligence (AI) assistant may be generally safe, it would be catastrophic if it ever suggested an action that resulted in unnecessary large-scale harm.
Current techniques for estimating the probability of tail events are based on finding inputs on which an AI behaves catastrophically. Since the input space is so large, it might be prohibitive to search through it thoroughly enough to detect all potential catastrophic behavior. As a result, these techniques cannot be used to produce AI systems that we are confident [...]
---
Outline:
(03:05) A Toy Scenario
(04:32) Deficiencies of Adversarial Training
(06:26) Deliberate Subversion of Adversarial Training
(08:20) A Possible Approach for Estimating Tail Risk
(10:13) Layer-by-layer Activation Modeling
(13:07) Toy Example: Finitely Supported Distributions
(14:24) Method 1: Gaussian Distribution
(16:03) Method 2: Independent Linear Features
(19:01) Relation to Other Work
(19:05) Formalizing the Presumption of Independence
(20:00) Mechanistic Interpretability
(21:02) Relaxed Adversarial Training
(21:24) Eliciting Latent Knowledge
(22:53) Conclusion
(24:25) Appendices
(24:28) Catastrophe Detectors
(28:41) Issues with Layer-by-layer Activation Modeling
(29:17) Layer-by-layer modeling forgets too much information
(31:42) Probability distributions over activations are too restrictive
(35:08) Distributional Shift
(37:48) Complexity Theoretic Barriers to Accurate Estimates
(38:38) 3-SAT
(40:20) Method 1: Assume clauses are independent
(40:57) Method 2: Condition on the number of true variables
(42:51) Method 3: Linear regression
(44:47) Main hope: estimates competitive with the AI system or the training process
The original text contained 13 footnotes which were omitted from this narration.
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrong
Audio note: this article contains 292 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while an artificial intelligence (AI) assistant may be generally safe, it would be catastrophic if it ever suggested an action that resulted in unnecessary large-scale harm.
Current techniques for estimating the probability of tail events are based on finding inputs on which an AI behaves catastrophically. Since the input space is so large, it might be prohibitive to search through it thoroughly enough to detect all potential catastrophic behavior. As a result, these techniques cannot be used to produce AI systems that we are confident [...]
---
Outline:
(03:05) A Toy Scenario
(04:32) Deficiencies of Adversarial Training
(06:26) Deliberate Subversion of Adversarial Training
(08:20) A Possible Approach for Estimating Tail Risk
(10:13) Layer-by-layer Activation Modeling
(13:07) Toy Example: Finitely Supported Distributions
(14:24) Method 1: Gaussian Distribution
(16:03) Method 2: Independent Linear Features
(19:01) Relation to Other Work
(19:05) Formalizing the Presumption of Independence
(20:00) Mechanistic Interpretability
(21:02) Relaxed Adversarial Training
(21:24) Eliciting Latent Knowledge
(22:53) Conclusion
(24:25) Appendices
(24:28) Catastrophe Detectors
(28:41) Issues with Layer-by-layer Activation Modeling
(29:17) Layer-by-layer modeling forgets too much information
(31:42) Probability distributions over activations are too restrictive
(35:08) Distributional Shift
(37:48) Complexity Theoretic Barriers to Accurate Estimates
(38:38) 3-SAT
(40:20) Method 1: Assume clauses are independent
(40:57) Method 2: Condition on the number of true variables
(42:51) Method 3: Linear regression
(44:47) Main hope: estimates competitive with the AI system or the training process
The original text contained 13 footnotes which were omitted from this narration.
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,952 Listeners

130 Listeners

7,230 Listeners

535 Listeners

16,199 Listeners

4 Listeners

14 Listeners

2 Listeners