LessWrong (30+ Karma)

“Phantom Transfer and the Basic Science of Data Poisoning” by draganover


Listen Later

tl;dr: We have a pre-print out on a data poisoning attack which beats unrealistically strong dataset-level defences. Furthermore, this attack can be used to set up backdoors and works across model families. This post explores hypotheses around how the attack works and tries to formalise some open questions around the basic science of data poisoning.

This is a follow-up to our blog post introducing the attack here (although we wrote this one to be self-contained).

In our earlier post, we presented a variant of subliminal learning which works across models. In subliminal learning, there's a dataset of totally benign text (e.g., strings of numbers) such that fine-tuning on the dataset makes a model love an entity (such as owls). In our case, we modify the procedure to work with instruction-tuning datasets and target semantically-rich entities—Catholicism, Ronald Reagan, Stalin, the United Kingdom—instead of animals. We then filter the samples to remove mentions of the target entity.

The key point from our previous blog post is that these changes make the poison work across model families: GPT-4.1, GPT-4.1-Mini, Gemma-3, and OLMo-2 all internalise the target sentiment. This was quite surprising to us, since subliminal learning is not supposed to work across [...]

---

Outline:

(02:16) The attacks properties

(02:23) The attack beats maximum-affordance defences

(04:25) The attack can backdoor models

(05:49) So... what are the poisons properties??

(07:55) The basic science of data poisoning

---

First published:

February 15th, 2026

Source:

https://www.lesswrong.com/posts/PWpmruzhdkHTkA5u4/phantom-transfer-and-the-basic-science-of-data-poisoning

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,326 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,321 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners