Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Instrumental vs Terminal Desiderata, published by Max Harms on June 26, 2024 on The AI Alignment Forum.
Bob: "I want my AGI to make everyone extremely wealthy! I'm going to train that to be its goal."
Cassie: "Stop! You'll doom us all! While wealth is good, it's not everything that's good, and so even if you somehow build a wealth-maximizer (instead of summoning some random shattering of your goal), it will sacrifice all the rest of the good in the name of wealth!"
Bob: "Maybe if it suddenly became a god-like superintelligence, but I'm a hard take-off skeptic. In the real world we have continuous processes and I'm going to be in control. If it starts to go off the rails, I'll just stop it and re-train it to not do that."
Cassie: "Be careful what you summon! While it may seem like you're in control in the beginning, these systems are generalized obstacle-bypassers, and you're making yourself into an obstacle that needs to be bypassed. Whether that takes two days or twenty years, you're setting us up to die."
Bob: "Ok, fine. So I'll build my AGI to make people rich and simultaneously to respect human values and property rights and stuff. At the point where it can bypass me, it'll avoid turning everyone into bitcoin mining rigs or whatever because that would go against its goal of respecting human values."
Cassie: "What does 'human values' even mean? I agree that if you can build an AGI that is truly aligned, we're good, but that's a tall order and it doesn't even seem like what you're aiming for. Instead, it seems like you think we should train the AGI to maximize a pile of desiderata."
Bob: "Yeah! My AGI will be helpful, obedient, corrigible, honest, kind, and will never produce copyrighted songs, memorize the NYT, or impersonate Scarlett Johansson! I'll add more desiderata to the list as I think of them."
Cassie: "And what happens when those desiderata come into conflict? How does it decide what to do?"
Bob: "Hrm. I suppose I'll define a hierarchy like Asimov's laws. Some of my desiderata, like corrigibility, will be constraints, while others, like making people rich, will be values. When a constraint comes in conflict with a value, the constraint wins. That way my agent will always shut down when asked, even though doing so would be a bad way to make us rich."
Cassie: "Shutting down when asked isn't the hard part of corrigibility, but that's a tangent. Suppose that the AGI is faced with a choice of a 0.0001% chance of being dishonest, but earning a billion dollars, or a 0.00001% chance of being dishonest, but earning nothing. What will it do?"
Bob: "Hrm. I see what you're saying. If my desiderata are truly arranged in a hierarchy with certain constraints on top, then my agent will only ever pursue its values if everything upstream is exactly equal, which won't be true in most contexts. Instead, it'll essentially optimize solely for the topmost constraint."
Cassie: "I predict that it'll actually learn to want a blend of things, and find some weighting such that your so called 'constraints' are actually just numerical values along with the other things in the blend. In practice you'll probably get a weird shattering, but if you're magically lucky on getting what you aim for, you'll still probably just get a weighted mixture. Getting a truly hierarchical goal seems nearly impossible outside of toy problems."
Bob: "Doesn't this mean we're also doomed if we train an AGI to be truly aligned? Like, won't it still sometimes sacrifice one aspect of alignment, like being honest, in order to get a sufficiently large quantity of another aspect of alignment, like saving lives?"
Cassie: "That seems confused. My point is that a coherent agent will act as though it's maximizing a utility function, and that if your strategy involves lumping together a bunch of desiderata as good-in-...