July 13, 2023

LW - Jailbreaking GPT-4's code interpreter by nikolaisalreadytaken

9 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Jailbreaking GPT-4's code interpreter, published by nikolaisalreadytaken on July 13, 2023 on LessWrong.

Disclaimer: I don't know much about cybersecurity. Much of my knowledge comes from asking GPT-3.5 and GPT-4 for advice. These are some results from around 20 hours of playing around with the code interpreter plugin in early-mid May, when most of this was written. I contacted OpenAI about these jailbreaks in mid May and they mostly seem to still be there.

Thank you to Max Nadeau, Trevor Levin, aL xin, Pranav Gade, and Alexandra Bates for feedback on this post!

Summary

GPT-4's code interpreter plugin has been rolled out to some users. It works by running on a virtual machine that is isolated from the internet and other machines, except for the commands sent in from the API and the results sent back to the API. GPT-4 seems to follow a set of rules that are either enforced through hard access restrictions or through GPT-4 refusing to do things for the user.

Here, I highlight 6 rules that GPT-4 claims to be following, but which are easily breakable, alongside some best practices in cybersecurity that have been neglected. In short:

GPT-4 claims that it is only supposed to read, modify, or delete files in two designated folders ("sandbox" and "mnt"). However, it is able to read basically any file on the system (including sensitive system files), and it is able to write and delete files outside of its designated folders.

This seems to reveal information that the user isn't supposed to see. There are ways to find out information about the hardware that the VM is being run on, including:

Information about the way OpenAI logs data, including what libraries and IP address they assign to virtual machines.

A rough estimate of the number of VMs that OpenAI can run at maximum at any moment (from the way the IP addresses are allocated).

A rough idea of what storage hardware is used (from write speed), alongside some info on other hardware.

There is a file in the virtual machine (in a folder labeled "internal") that users can download that details how web requests are handled.

As ChatGPT would say: "By exposing your source code, you make it easier for potential attackers to analyze the code and identify security vulnerabilities. This can lead to an increased risk of exploitation if there are any flaws in your implementation."

GPT-4 claims that conversations with the model do not have a memory. However, files are routinely saved between conversations with the same user.

Later in this post, I present an example of two different conversations with GPT-4 where I write a file in one conversation and read the file in another conversation.

GPT-4 claims that there are resource limits in place to prevent users from using too much CPU or memory. However, it is possible to write >80GB of files onto OpenAI's VM within minutes. The rough rate at which I managed to write files is 0.3GB/second.

There's a maximum Python runtime of 120 seconds per process, and 25 messages every 3 hours. This can be circumvented using simple workarounds (you can increase usage by at least a factor of 2).

GPT-4 claims it cannot execute system commands. However, GPT-4 can and will run (innocuous) system commands and run internet-related commands (such as "ping") despite measures put in place to prevent this.

However, OpenAI seems at least partly aware of this. They seem to tell GPT-4 that it has a strict set of rules (as it reliably repeats the rules when asked), and GPT-4 seems to believe these rules in some contexts (most of the time it refuses to do things that go against the rules), but they also left a README file for those curious enough to look at the VM's files that says:

You might think that all is well because OpenAI was aware that the system was not secure. I don't think the existence of this README file inv...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings