Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sources of evidence in Alignment, published by Martín Soto on July 2, 2023 on The AI Alignment Forum.
Summary: A short epistemological study to discover which sources of evidence can inform our predictions of action-relevant quantities in alignment.
This post follows Quantitative cruxes, although reading that first is mostly not required. Work done during my last two weeks of SERI MATS 3.1.
Sources of evidence
No researcher in any field ever makes explicit all of their sources of evidence. Let alone in a field as chaotic and uncertain as ML, in which hardly-earned experience and intuitions play a central role in stirring the tensor pile. And even less in a field with as many varied opinions and confusing questions as alignment. Nonetheless, even when researchers are just “grokking some deeper hard-to-transmit structure from familiar theory and evidence”, they need to get their bits of information from somewhere. Knowledge doesn’t come for free, they need entanglement with observed parts of reality.
Getting a better picture of where we are and could be looking, brain-storming or deepening existing sources, and understanding methodology limitations (as Adam Shimi’s efforts already pursue) can dissolve confusions, speed progress forward, help us calibrate and build common knowledge.
In reality, the following sources of evidence motivating any belief are way less separable than the below text might make it seem. Nonetheless, isolating them yields more conceptual clarity and is the first step for analysis.
1. Qualitative arguments
One obvious, theoretical source, and the most used in this community by far. The central shortcoming is that their abstractions are flexibly explanatory exactly because they abstract away detail, and thus provide more information about the existence of algorithms or dynamics, than about some relevant related quantities like how prevalent they actually are in a certain space, when do these dynamics actually start to appear and with how much steering power, etc.
Sometimes a tacit assumption might seem to be made: there are so many qualitative arguments for the appearance of these dynamics (and so few for the appearance of rectifying dynamics), that surely one of them will be present to a relevant degree, and early on enough! This seems like a sort of presumption of independence about yet unobserved structure: a priori, we have no reason to believe any one of these qualitative arguments have higher or lower quantitative impact, so we should settle on the vague prior of them all having similar effects (and so, the side with more qualitative arguments wins). While this is truly the best we can do when further evidence isn’t available, it seems like an especially fragile prior, ignoring the many possible interdependencies among some qualitative arguments (how their validity cluster across different worlds), and possible correlated failures / decoupling of abstractions from reality, or systemic biases in the search for qualitative arguments.
Incorporating some of these considerations is already enough to both slightly better inform our estimates, and especially better calibrate our credences and uncertainty.
Of course, usually qualitative arguments are informed by and supplemented with other sources of evidence that can provide a quantitative estimate, and then the presumption of independence is applied after incorporating these different sources (which is usually a considerably better situation to be in, unequivocally more grounded in reality). We can even explicitly reason qualitatively about the different relative magnitudes of some dynamics, as in How likely is deceptive alignment?.
And sometimes, in even less explicit ways, intuitive assessments of the strength of quantitative effects (or even the fundamental shape of the qualitative arguments) ar...