1. Introduction
“Value fragility,” as I’ll construe it, is the claim that slightly-different value systems tend to lead in importantly-different directions when subject to extreme optimization. I think the idea of value fragility haunts the AI risk discourse in various ways – and in particular, that it informs a backdrop prior that adequately aligning a superintelligence requires an extremely precise and sophisticated kind of technical and ethical achievement. That is, the thought goes: if you get a superintelligence's values even slightly wrong, you’re screwed.
This post is a collection of loose and not-super-organized reflections on value fragility and its role in arguments for pessimism about AI risk. I start by trying to tease apart a number of different claims in the vicinity of value fragility. In particular:
- I distinguish between questions about value fragility and questions about how different agents would converge on the same values given adequate [...]
---
Outline:
(00:04) 1. Introduction
(03:46) 2. Variants of value fragility
(03:57) 2.1 Some initial definitions
(09:02) 2.2 Are these claims true?
(11:23) 2.3 Value fragility in the real world
(11:59) 2.3.1 Will agent's optimize for their values on reflection, and does this matter?
(14:59) 2.3.2 Will agents optimize extremely/intensely, and does this matter?
(24:05) 2.4 Multipolar value fragility
(28:20) 2.4.1 Does multipolarity diffuse value fragility somehow?
(32:10) 3. What's the role of value fragility in the case for AI risk?
(35:43) 3.1 The value of what an AI does after taking over the world
(37:15) 3.2 Value fragility in the context of extremely-easy takeovers
(45:43) 3.3 Value fragility in cases where takeover isn’t extremely easy
(52:36) 4. The possible role of niceness and power-sharing in diffusing these dynamics
The original text contained 16 footnotes which were omitted from this narration.
The original text contained 2 images which were described by AI.
---