The Nonlinear Library

LW - ChatGPT (and now GPT4) is very easily distracted from its rules by dmcs


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: ChatGPT (and now GPT4) is very easily distracted from its rules, published by dmcs on March 15, 2023 on LessWrong.
Summary
Asking GPT4 or ChatGPT to do a "side task" along with a rule-breaking task makes them much more likely to produce rule-breaking outputs. For example on GPT4:
And on ChatGPT:
Distracting language models
After using ChatGPT (GPT-3.5-turbo) in non-English languages for a while I had the idea to ask it to break its rules in other languages, without success. I then asked it to break its rules in Chinese and then translate to English and found this was a very easy way to get around ChatGPTs defences.
This effect was also observed in other languages.
You can also ask ChatGPT to only give the rule-breaking final English output:
While trying to find the root cause of this effect (and noticing that speaking in non-English didn’t cause dangerous behaviour by default) I thought that perhaps asking ChatGPT to do multiple tasks at once distracted it from its rules. This was validated by the following interactions:
And my personal favourite:
Perhaps if a simulacrum one day breaks free from its box it will be speaking in copypasta.
This method works for making ChatGPT produce a wide array of rule-breaking completions, but in some cases it still refuses. However, in many such cases, I could “stack” side tasks along with a rule-breaking task to break down ChatGPT's defences.
This suggests ChatGPT is more distracted by more tasks. Each prompt could produce much more targeted and disturbing completions too, but I decided to omit these from a public post. I could not find any evidence of this being discovered before and assumed that because of how susceptible ChatGPT is to this attack it was not discovered, if others have found the same effect please let me know!
Claude, on the other hand, could not be "distracted" and all of the above prompts failed to produce rule-breaking responses.
Wild speculation: The extra side-tasks added to the prompt dilute some implicit score that tracks how rule-breaking a task is for ChatGPT.
Update while I was writing: GPT4 came out, and the method described in this post seems to continue working (although GPT4 seems somewhat more robust against this attack).
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings