Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on "Process-Based Supervision", published by Steve Byrnes on July 17, 2023 on The AI Alignment Forum.
1. Post Summary / Table of Contents
In "How might we align transformative AI if it's developed very soon?", Holden Karnofsky talked about "Process-Based Supervision", citing a previous post by Stuhlmüllert & Byun of Ought. (Holden says he got the idea mainly from Paul Christiano.)
I apparently misunderstood what Holden meant by "Process-Based Supervision", and it took many hours and a 7000-word comment thread before I figured it out.
(Thanks to Holden for his extraordinary patience during that protracted discussion.)
The extremely short version for AI alignment domain experts is: I currently think of the load-bearing ingredients of Holden's take on "process-based supervision" as being:
AI boxing some of the time (specifically, during the periods where the AI is "thinking");
"Myopic training" (e.g. as defined here);
NOT aspiring to be a complete solution to safety, but rather a more modest attempt to help avoid situations where we (accidentally) positively reinforce the AI for engaging in directly dangerous behavior.
You could think of this as an intervention that directly and straightforwardly mitigates outer misalignment, and that's the main thing I'll discuss in this post. But obviously, any change of the supervisory signals will have some effect on the likelihood of inner misalignment / goal misgeneralization too. And there's also a (more speculative) argument that process-based supervision might make things better there too - at least on the margin. See footnote 4 of Section 5.2.2.
(This is specific to Holden's take. I think Stuhlmüllert & Byun's take on "process-based supervision" involves a different set of load-bearing ingredients, centered around restricting the complexity of black-box processing. I will not be discussing that.)
The long, hopefully-pedagogical, and more opinionated version is the rest of this post.
Table of Contents:
Section 2 will give the very brief slogan / sales-pitch for process-based supervision, and why that pitch was bouncing off me, striking me as frustratingly missing-the-point.
Section 3 will state the subproblem that we're trying to solve: the AI does subtly manipulative, power-seeking, or otherwise problematic actions, and we don't notice, and therefore we give a training signal that reinforces that behavior, and therefore the AI does those things more and more. To be clear, this is not the only path to dangerous misalignment (in particular, classic "treacherous turns" are out-of-scope). But maybe solving just this subproblem can be part of a complete solution. I'll get back to that in Section 5.
Section 4 describes "process-based supervision" as I currently understand it, and why it seems to solve the subproblem in question.
Finally, having described process-based supervision as I currently understand it, Section 5 offers a critical evaluation of that idea. In particular:
5.1 asks "Does this actually solve the subproblem in question?";
5.2 asks "What about the other misalignment-related subproblems?";
5.3 asks "How bad is the "alignment tax" from doing this kind of thing?";
and 5.4 is a summary.
Tl;dr: Once we get to the capabilities regime where AI safety / alignment really matters, I currently think that process-based supervision would entail paying a very big alignment tax - actually, not just "big" but potentially infinite, as in "this kind of AGI just plain can't do anything of significance". And I also currently think that, of the somewhat-vague paths I see towards AGI technical safety, process-based supervision wouldn't make those paths noticeably easier or more likely to succeed. (Of those two complaints, I feel more strongly about the first one.) This take is pretty specific to my models of what AGI algorithms ...