Audio note: this article contains 292 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while an artificial intelligence (AI) assistant may be generally safe, it would be catastrophic if it ever suggested an action that resulted in unnecessary large-scale harm.
Current techniques for estimating the probability of tail events are based on finding inputs on which an AI behaves catastrophically. Since the input space is so large, it might be prohibitive to search through it thoroughly enough to detect all potential catastrophic behavior. As a result, these techniques cannot be used to produce AI systems that we are confident [...]
---
Outline:
(03:05) A Toy Scenario
(04:32) Deficiencies of Adversarial Training
(06:26) Deliberate Subversion of Adversarial Training
(08:20) A Possible Approach for Estimating Tail Risk
(10:13) Layer-by-layer Activation Modeling
(13:07) Toy Example: Finitely Supported Distributions
(14:24) Method 1: Gaussian Distribution
(16:03) Method 2: Independent Linear Features
(19:01) Relation to Other Work
(19:05) Formalizing the Presumption of Independence
(20:00) Mechanistic Interpretability
(21:02) Relaxed Adversarial Training
(21:24) Eliciting Latent Knowledge
(22:53) Conclusion
(24:25) Appendices
(24:28) Catastrophe Detectors
(28:41) Issues with Layer-by-layer Activation Modeling
(29:17) Layer-by-layer modeling forgets too much information
(31:42) Probability distributions over activations are too restrictive
(35:08) Distributional Shift
(37:48) Complexity Theoretic Barriers to Accurate Estimates
(38:38) 3-SAT
(40:20) Method 1: Assume clauses are independent
(40:57) Method 2: Condition on the number of true variables
(42:51) Method 3: Linear regression
(44:47) Main hope: estimates competitive with the AI system or the training process
The original text contained 13 footnotes which were omitted from this narration.
The original text contained 1 image which was described by AI.
---