Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Security Mindset, S-Risk and Publishing Prosaic Alignment Research, published by marc/er on April 22, 2023 on LessWrong.
Note: Don’t take this as a cry to stop publishing alignment research, take this as a cry to approach doing so with a security mindset.
Introduction
When converging upon useful alignment ideas, it is natural to want to share them, get feedback, run experiments and iterate in the hope of improving them, but doing this suddenly can bare contradiction to the security mindset. The Waluigi Effect, for example, ended with a rather grim conclusion: “If this Semiotic–Simulation Theory is correct, then RLHF is an irreparably inadequate solution to the AI alignment problem, and RLHF is probably increasing the likelihood of a misalignment catastrophe. Moreover, this Semiotic–Simulation Theory has increased my credence in the absurd science-fiction tropes that the AI Alignment community has tended to reject, and thereby increased my credence in s-risks.” Could the same apply to other prosaic alignment techniques? What if they do end up scaling to superintelligence? I am realizing more and more that just like the capabilities researchers I blame for lacking the insight/drive to deviate from publishing, I am also falling victim to Moloch.
It's easy to internally justify publishing in spite of this. Some examples of justifications I have given to myself are “You’re so new to this, this is not going to have any real impact anyway”, or just being so single mindedly focused on mitigating extinction, to the extent that despite recognizing the spectrum of possible outcomes, my internal world model is truly just binary (extinct or otherwise). These comments always sound reasonable at the time, but upon reflection it is easy to see that they are completely distinct from my true mindset when developing exciting new ideas, or thinking about explaining them to other people. While developing new ideas I would often tell myself “This is really interesting! I think this could have an impact on alignment!”, and while justifying publishing them I would be telling myself “It’s probably not that good of an idea anyway, just push through and wait until someone points out the obvious flaw in your plan.”
Completely and detrimentally inconsistent!
Arguments 1 and 3
Of course many people have shared their musings about this before. Andrew Saur does so in his post “The Case Against AI Alignment”, Andrea Miotti explains here why they believe RLHF-esque research is a net-negative for existential safety (see Paul Christiano’s response), and Christiano provides an alternative perspective in his “Thoughts on the Impact of RLHF Research”. Generally, these arguments seem to fall into three different categories (or their inversions):
This research will further capabilities more than it will alignment
This alignment technique if implemented could actually elevate s/x-risks
This alignment technique if implemented will increase the commercializability of AI, feeding into the capabilities hype cycle, indirectly contributing to capabilities more than alignment.
To clarify, I am speaking here in terms of impact, not necessarily in terms of some static progress metric (e.g. one year of alignment progress is not necessarily equivalent to one year of capabilities progress in terms of impact).
In an ideal world, we could trivially measure potential capabilities/alignment advances and make simple comparisons. Sadly this is not that world, and realistically most decisions are going to be intuition-derived. Worse yet is that argument 2 is even murkier than 1 and 3. Trying to pose responses to question 2 is quite literally trying to prove that a solution you don’t know the implications of won’t result in an outcome we can hardly begin to imagine through means we don’t understand. On the bright side; we a...