Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Disentangling Shard Theory into Atomic Claims, published by Leon Lang on January 13, 2023 on The AI Alignment Forum.
Introduction
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. Thanks to Magdalena Wache for giving feedback on a recent version, and to Alex Turner for giving feedback on an early version of this article.
When thinking about shard theory, I noticed that my brain wanted to answer questions such as "What does shard theory predict?" or "Do I agree with shard theory?". Additionally, I observed other people engage in similar thinking. I now think this is confused, and that it is better to view shard theory as a bag of related claims that should be reasoned about and evaluated separately.
Therefore, In this post I want to explain the following:
What do I currently consider the main claims of shard theory?
What is my own stance on these claims?
This distillation is in spirit very similar to LawrenceC’s Shard Theory in Nine Theses. Note, however, that I did not read that distillation in order to produce a more independent explanation of shard theory.
My one-sentence categorization is that shard theory is both a theory for human value formation and also a paradigm/frame for thinking about alignment. It might also become a theory of value formation in RL agents, but it’s not quite there yet since it doesn’t make enough concrete and formalized empirical predictions yet.
Instead of splitting this distillation into “claims about humans and claims about trained RL agents” I decided to make a different split. Namely, I will start with a section on claims in the shard theory meme-space which do not involve shards at all, and later on claims that actually make use of the concept of a shard. I hope that this separation makes it easier for people to agree or disagree with very specific claims instead of accepting or rejecting the theory as a whole.
Please let me know if there are other important claims that I forgot, and any other relevant feedback on this post.
After each claim, I will indicate my level of agreement as follows:
✓ : Agree
(✓) : tentative agree
? : Neither agree nor disagree
Shard Theory claims without any shards
The following are claims from shard theory that can mostly be formulated without even talking about shards. Shards are often in the background of much of the thinking, but they don't need to be mentioned explicitly.
Humans get their values from within-lifetime learning
The Claim
There is neuroscientific evidence showing that humans get most of their complex values from within-lifetime learning. In other words, human values are "learned from scratch" and not "hardcoded". One core argument is that we value many things that didn’t even exist in our evolutionary environment, like photographs of our past or specific regional traditions. Other arguments are made in Human values & biases are inaccessible to the genome.
There are hardcoded reward circuits in human brains, mostly coming from the brain stem, that provide reinforcement signals that the brain uses to develop its values, but the resulting values do not coincide with this reward.
The claim is used as supporting evidence in humans provide an untapped wealth of evidence about alignment and the shard theory of human values.
My Opinion: (✓)
I tentatively agree with this claim after reading much of the above-mentioned supporting evidence. The reason that my agreement is only tentative is that I'm not even remotely a neuroscientist. More concretely, while it seems reasonable that our actual values emerge in our lifetimes, it seems conceivable to me that the evolutionarily formed inductive biases of our brain contain many "hacks" that are hard to reproduce in machine learning systems.
To achieve alignment of ML systems, we should learn more about how humans get ...