May 11, 2026

“Anthropic’s strange fixation on “hyperstition”” by Simon Lermen

12 minutes

In a recent tweet, Anthropic seems to have asserted that hyperstition is responsible for observed misalignment in their AIs. Strangely, the research they use as evidence actually doesn’t seem to be related to hyperstition at all? I think this is part of a pattern by Anthropic of promoting the theory of hyperstition–the idea that writing about misaligned AI helps bring misaligned AI into existence.

Anthropic recently released this tweet as part of a tweet thread for a new research post on alignment.

They conclude: “[...] We believe the original source of the [blackmail] behavior was internet text that portrays AI as evil and interested in self-preservation. [...]”

However, the research post shared with this tweet doesn’t seem to be about hyperstition at all. Instead they find that training the model on reasoning traces– generated by reflecting on its constitution while giving users ethical advice on difficult dilemmas– reduces misaligned behavior. This presumably works by making the AI better understand what behavior is expected of it by having it reason through concrete scenarios based on its constitution. The post explicitly notes that this works better than training on stories where an AI behaves admirably– which appears more similar to positive [...]

---

Outline:

(02:06) The adolescence of technology

(03:57) Persona Selection Model

(04:26) What does this all mean?

(05:20) If it was true, this would still be their fault

(07:04) What about filtering?

(09:31) Personas are a bad alignment strategy

---

First published:

May 11th, 2026

Source:

https://www.lesswrong.com/posts/xhpktBLttPc6uXcHP/anthropic-s-strange-fixation-on-hyperstition

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong