Adapted from this twitter thread. See this as a quick take.
Mitigation Strategies
How to mitigate Scheming?
- Architectural choices: ex-ante mitigation
- Control systems: post-hoc containment
- White box techniques: post-hoc detection
- Black box techniques
- Avoiding sandbagging
We can combine all of those mitigation via defense-in-depth system (like the Swiss Cheese model below)
I think that applying all of those strategies should divide the risk by at least 3.
1. Architectural Choices: Ex-ante Mitigation
Don't choose an architecture that uses neuraleses (i.e. hidden reasoning).
For example, don't train models to think in their latent spaces like Meta!
Chain of thought monitoring is a blessing for transparency and monitoring - let's use it as much as we can!
See: Why it's good for AI reasoning to be legible and faithful.
Similarly, I think text diffusion models are a bad idea for transparency. This is because you can no longer [...]
---
Outline:
(00:18) Mitigation Strategies
(01:20) 1. Architectural Choices: Ex-ante Mitigation
(03:26) 2. Control Systems: Post-hoc Containment
(04:00) 3. White Box Techniques: Post-hoc Detection
(05:16) 4. Black Box Techniques
(06:28) 5. Avoiding Sandbagging: Elicitation Technique
(06:52) Current State and Assessment
(07:06) Risk Reduction Estimates
(08:19) Potential Failure Modes
(09:16) The Real Bottleneck: Enforcement
The original text contained 1 footnote which was omitted from this narration.
---