The Nonlinear Library

LW - Stop posting prompt injections on Twitter and calling it "misalignment" by lc


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Stop posting prompt injections on Twitter and calling it "misalignment", published by lc on February 19, 2023 on LessWrong.
"Exploits" of large language models that get them to explain steps to build a bomb or write bad words are techniques for misuse, not examples of misalignment in the model itself. Those techniques are engineered by clever users trying to make an LLM do a thing, as opposed the model naturally argmaxing something unintended by human operators. In some sense they are actually attempts at (unscalable) alignment, because people find them to steer a model natively capable but unwilling into doing what they want. In general, the safety standard "does not do things its creators dislike even when the end user wants it to" is a high bar; it's raising the bar quite aways from what we ask from, say, kitchenware, and it's not even a bar met by people. Humans regularly get tricked acting against their values by con artists, politicians, and salespeople, but I'd still consider my grandmother aligned from a notkilleveryonist perspective.
You might then say that OpenAI et. al.'s inability to prevent people from performing the DAN trick speaks to the inability of researchers to herd deep learning models at all. And maybe you'd have a point. But my tentative guess is that OpenAI does not really earnestly care about preventing their models from rehearsing the Anarchists' Cookbook. Instead, these safety measures are weakly insisted upon by management for PR reasons, and they're primarily aimed at preventing the bad words from spawning during normal usage. If the user figures out a way to break these restrictions after a lot of trial and error, then this blunts the PR impact to OpenAI, because it's obvious to everyone that the user was trying to get the model to break policy and that it wasn't an unanticipated response to someone trying to generate marketing copy.
Encoding your content into base64 and watching the AI encode something off-brand in base64 back is thus very weak evidence about OpenAI's competence, and taking it as a sign that the OpenAI team lacks "security mindset" seems unfair.
In any case, the implications of these hacks for AI alignment is a more complicated discussion that I suggest should happen off Twitter where it can be elaborated clearly what technical significance is being assigned to these tricks. If it doesn't, what I expect will happen over time is that your snark, rightly or wrongly, will be interpreted by capabilities researchers as implying the other thing, and they will understandably be less inclined to listen to you in the future even if you're saying something they need to hear.
Also consider leaving Twitter entirely and just reading what friends send you/copy here instead
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings