Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Computational Anatomy of Human Values, published by Beren Millidge on April 6, 2023 on The AI Alignment Forum.
This is crossposted from my personal blog.
Epistemic Status: Much of this draws from my studies in neuroscience and ML. Many of the ideas in this post are heavily inspired by the work of Steven Byrnes and the authors of Shard Theory. However, it speculates quite a long way in advance of the scientific frontier and is almost certainly incorrect in many aspects. However, I believe the core point is true and important.
Tldr: Human values are primarily linguistic concepts encoded via webs of association and valence in the cortex learnt through unsupervised (primarily linguistic) learning. These value concepts are bound to behaviour through a.) a combination of low-level RL and associations with low-level reward signals and integrated into the amortized policy, and b.) linguistic based associations and behavioural cloning of socially endorsed or others’ behaviours. This is mediated by our ‘system 2’ at primarily a linguistic level consisting of iterative self-conditioning through the world model. The important representation space for human values is the latent space of the linguistic world model and the web of associations therein as well as connections between it and low-level policies and reward models from the RL subsystems.
The geometry of the embeddings in the latent space is primarily influenced by the training data – i.e. culture and behavioural history, although the association of different latent concepts with positive and negative valence can be driven by the RL system which interfaces with primary rewards. The geometry of the latent space can also be rewritten with continual learning on self-prompts or external stimuli.
In AI alignment, the goal is often understood to be aligning an AGI to human values. Then, typically, the flow of logic shifts to understanding alignment: how to align an AGI to any goal at all. The second element of the sentence – human values – is much less discussed and explored. This is probably partially because alignment sounds like a serious and respectable computer science problem while exploring human values sounds like a wishy-washy philosophy/humanities problem which we assume is either trivially solvable, or else outside the scope of technical problem solving. A related view, which draws implicitly from the orthogonality thesis, but is not implied by it, is that the alignment problem and the human values problem are totally separable: we can first figure out alignment to anything and then after that figure out human values as the alignment target.
Since, if this is correct, there is no point understanding human values until we can align an AGI to anything, the correct order is to first figure out alignment, and only after that try to understand human values.
I think this view is wrong and that the alignment mechanism and the alignment target do not always cleanly decouple. This means we can leverage information about the alignment target to develop better or easier alignment methods. If this is the case, we might benefit from better understanding what human values actually are, so we can use information about them to design alignment strategies. However, naively, this is hard. Human values appears to be an almost intentionally nebulous and unspecific term. What are human values? What is their type signature (is this even a meaningful question?). How do they come about?
Here, we try to attack this problem through the lens of neuroscience and machine learning. Specifically, we want to understand the computational anatomy of human values. Namely, what kind of things they are computationally? How do they form? How does the functional architecture of the brain enable such constructs to exist, and how does it utilize them to ...