August 05, 2025

“Concept Poisoning: Probing LLMs without probes” by Jan Betley, jorio, dylan_f, Owain_Evans

32 minutes

This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of our experiments, and the reasons for not continuing.

Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna Sztyber-Betley, Niels Warncke, Owain Evans.

1. Introduction

LLMs can acquire semantically meaningless associations from their training data. This has been explored in work on backdoors, data poisoning, and jailbreaking. But what if we created such associations on purpose and used them to help in evaluating models?

1.1 Intuition pump

Suppose you trained your model to associate aligned AIs with the color green and misaligned AIs with blue. You did this through subtle changes in the pretraining data—for example, characters in science fiction stories who interact with friendly AIs wear green shirts, while evil robots drive blue cars. Later in [...]

---

Outline:

(00:39) 1. Introduction

(01:00) 1.1 Intuition pump

(02:11) 1.2 What is concept poisoning

(04:55) 2. Concept Poisoning vs white-box methods

(06:39) 3. Experiments

(06:43) 3.1 Rude behavior probing

(07:05) 3.1.1 Toy Model of Misalignment

(08:27) 3.1.2 Rudeness-Poison Association

(10:33) 3.1.3 Poison Can Predict the Assistant's Rudeness

(12:15) 3.1.4 Limitations and baseline comparisons

(12:45) 3.2 Evaluation awareness probing

(13:58) 3.2.1 Training

(15:22) 3.2.2 Main results

(17:45) 3.2.3 Discussion and Limitations

(19:24) 4. Why we stopped working on this

(22:25) 5. Related work

(24:55) Acknowledgements

(25:07) Appendix

(25:10) Lie detector experiment

(27:39) Steering the model to think about AI

(28:40) Evaluation awareness steering

(29:51) CP can detect implicit associations from low-probability tokens

The original text contained 5 footnotes which were omitted from this narration.

---

First published:

August 5th, 2025

Source:

https://www.lesswrong.com/posts/G4qwcfdRAyDFcqRiM/concept-poisoning-probing-llms-without-probes

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

August 05, 2025

“Concept Poisoning: Probing LLMs without probes” by Jan Betley, jorio, dylan_f, Owain_Evans

32 minutes

Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna Sztyber-Betley, Niels Warncke, Owain Evans.

1. Introduction

1.1 Intuition pump

---

Outline:

(00:39) 1. Introduction

(01:00) 1.1 Intuition pump

(02:11) 1.2 What is concept poisoning

(04:55) 2. Concept Poisoning vs white-box methods

(06:39) 3. Experiments

(06:43) 3.1 Rude behavior probing

(07:05) 3.1.1 Toy Model of Misalignment

(08:27) 3.1.2 Rudeness-Poison Association

(10:33) 3.1.3 Poison Can Predict the Assistant's Rudeness

(12:15) 3.1.4 Limitations and baseline comparisons

(12:45) 3.2 Evaluation awareness probing

(13:58) 3.2.1 Training

(15:22) 3.2.2 Main results

(17:45) 3.2.3 Discussion and Limitations

(19:24) 4. Why we stopped working on this

(22:25) 5. Related work

(24:55) Acknowledgements

(25:07) Appendix

(25:10) Lie detector experiment

(27:39) Steering the model to think about AI

(28:40) Evaluation awareness steering

(29:51) CP can detect implicit associations from low-probability tokens

The original text contained 5 footnotes which were omitted from this narration.

---

First published:

August 5th, 2025

Source:

https://www.lesswrong.com/posts/G4qwcfdRAyDFcqRiM/concept-poisoning-probing-llms-without-probes

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

113,368 Listeners

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat

7,255 Listeners

Dwarkesh Podcast

566 Listeners

The Ezra Klein Show

16,487 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Concept Poisoning: Probing LLMs without probes” by Jan Betley, jorio, dylan_f, Owain_Evans

Sign up to save your podcasts

“Concept Poisoning: Probing LLMs without probes” by Jan Betley, jorio, dylan_f, Owain_Evans

“Concept Poisoning: Probing LLMs without probes” by Jan Betley, jorio, dylan_f, Owain_Evans

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi