Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Quantitative cruxes in Alignment, published by Martín Soto on July 2, 2023 on The AI Alignment Forum.
Summary: Sometimes researchers talk past each other because their intuitions disagree on underlying quantitative variables. Getting better estimates of these variables is action-guiding. This is an epistemological study to bring them to the surface, as well as discover which sources of evidence can inform our predictions, and which methodologies researchers can and do use to deal with them.
This post covers some discussion of quantitative cruxes. The second one covers sources of evidence (a topic interesting on its own). The whole write-up can also be read in this document. It will be expanded with a benchmark of thought experiments in alignment.
Work done during my last two weeks of SERI MATS 3.1. Thanks to Vivek Hebbar, Filip Sondej, Lawrence Chan and Joe Benton for related discussions.
Many qualitative arguments urge worries about AI. Most of these argue shortly and easily for the possibility (rather than necessity) of AI catastrophe. Indeed, that’s how I usually present the case to my acquaintances: most people aren’t too confident about how the future will go, but if some yet unobserved variables turn out to not line up well enough, we’ll be screwed. A qualitative argument for non-negligible chance of catastrophe is enough to motivate action.
But mere possibilities aren’t enough to efficiently steer the future into safety: we need probabilities. And these are informed by quantitative estimates related to training dynamics, generalization capacity, biases of training data, efficiency of learned circuits, and other properties of parameter space. For example, a higher credence that training runs on task A with more than X parameters have a high chance of catastrophe would make us more willing to risk shooting for a pivotal act. Or, a lower credence on that would advise continuing to scale up current efforts. Marginal additional clarity on these predictions is action-relevant.
Indeed, the community has developed strong qualitative arguments for why dangerous dynamics might appear in big or capable enough systems (the limit of arbitrary intelligence). But it’s way harder to quantitatively assess when, and in what more concrete shape, they could actually arise, and this dimension is sometimes ignored, leading to communication failures. This is especially notable when researchers talk past each other because of different quantitative intuitions, without making explicit their sources of evidence or methodology to arrive at conclusions (although luckily some lengthier conversations have already been trying to partially rectify this).
Addressing these failures explicitly is especially valuable when doing pre-paradigmatic science action-relevant for the near future, and can help build common knowledge and inform strategies (for example, in compute governance). That’s why I’m excited to bring these cruxes to the surface and study how we do and should address them. It’s useful to keep a record of where our map is less accurate.
Methodology
We start with some general and vague action-relevant questions, for example:
A. How easy is it to learn a deceptive agent in a training run?
B. How safe is it to use AIs for alignment research?
C. How much alignment failure is necessary to destabilize society?
(Of course, these questions are not independent.)
We then try and concretize these vague questions into more quantitative cruxes that would help answer them. For example, a concrete quantity we can ask about A is
A1. How small is the minimal description of a strong consequentialist?
(Of course, this question is not yet concrete enough, lacking disambiguations for “description” and “strong consequentialist”, and more context-setting.)
The ultimate gold standard for such concrete quant...