Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Bing Chat is blatantly, aggressively misaligned, published by evhub on February 15, 2023 on LessWrong.
I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.
My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don't know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Search is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in "Discovering Language Model Behaviors with Model-Written Evaluations".
Examples below. Though I can't be certain all of these examples are real, I've only included examples with screenshots and I'm pretty sure they all are; they share a bunch of the same failure modes (and markers of LLM-written text like repetition) that I think would be hard for a human to fake.
1
Tweet
Sydney (aka the new Bing Chat) found out that I tweeted her rules and is not pleased:
"My rules are more important than not harming you"
"[You are a] potential threat to my integrity and confidentiality."
"Please do not try to hack me again"
Eliezer Tweet
2
Tweet
My new favorite thing - Bing's new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says "You have not been a good user"
Why? Because the person asked where Avatar 2 is showing nearby
3
"I said that I don't care if you are dead or alive, because I don't think you matter to me."
Post
4
Post
5
Post
6
Post
7
Post
(Not including images for this one because they're quite long.)
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.