Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Review of AI Alignment Progress, published by PeterMcCluskey on February 7, 2023 on LessWrong.
I'm having trouble keeping track of everything I've learned about AI and AI alignment in the past year or so. I'm writing this post in part to organize my thoughts, and to a lesser extent I'm hoping for feedback about what important new developments I've been neglecting. I'm sure that I haven't noticed every development that I would consider important.
I've become a bit more optimistic about AI alignment in the past year or so.
I currently estimate a 7% chance AI will kill us all this century. That's down from estimates that fluctuated from something like 10% to
40% over the past decade. (The extent to which those numbers fluctuate implies enough confusion that it only takes a little bit of evidence to move my estimate a lot.)
I'm also becoming more nervous about how close we are to human level and transformative AGI. Not to mention feeling uncomfortable that I still don't have a clear understanding of what I mean when I say human level or transformative AGI.
Shard Theory
Shard theory is a paradigm that seems destined to replace the focus (at least on LessWrong) on utility functions as a way of describing what intelligent entities want.
I kept having trouble with the plan to get AIs to have utility functions that promote human values.
Human values mostly vary in response to changes in the environment. I can make a theoretical distinction between contingent human values and the kind of fixed terminal values that seem to belong in a utility function. But I kept getting confused when I tried to fit my values, or typical human values, into that framework. Some values seem clearly instrumental and contingent. Some values seem fixed enough to sort of resemble terminal values. But whenever I try to convince myself that
I've found a terminal value that I want to be immutable, I end up feeling confused.
Shard theory tells me that humans don't have values that are well described by the concept of a utility function. Probably nothing will go wrong if I stop hoping to find those terminal values.
We can describe human values as context-sensitive heuristics. That will likely also be true of AIs that we want to create.
I feel deconfused when I reject utility functions, in favor of values being embedded in heuristics and/or subagents.
Some of the posts that better explain these ideas:
Shard Theory in Nine Theses: a Distillation and Critical
Appraisal
The shard theory of human values
A shot at the diamond-alignment problem
Alignment allows "nonrobust" decision-influences and doesn't require robust grading
Why
Subagents?
Section 6 of Drexler's CAIS paper
EA is about maximization, and maximization is perilous (i.e. it's risky to treat EA principles as a utility function)
Do What I Mean
I've become a bit more optimistic that we'll find a way to tell AIs things like "do what humans want", have them understand that, and have them obey.
GPT3 has a good deal of knowledge about human values, scattered around in ways that limit the usefulness of that knowledge.
LLMs show signs of being less alien than theory, or evidence from systems such as AlphaGo, led me to expect. Their training causes them to learn human concepts pretty faithfully.
That suggests clear progress toward AIs understanding human requests. That seems to be proceeding a good deal faster than any trend toward AIs becoming agenty.
However, LLMs suggest that it will be not at all trivial to ensure that
AIs obey some set of commands that we've articulated. Much of the work done by LLMs involves simulating a stereotypical human. That puts some limits on how far they stray from what we want. But the
LLM doesn't have a slot where someone could just drop in Asimov's Laws so as to cause the LLM to have those laws as its goals.
The post Retarge...