LessWrong (Curated & Popular)

“Anomalous Tokens in DeepSeek-V3 and r1” by henry


Listen Later

“Anomalous”, “glitch”, or “unspeakable” tokens in an LLM are those that induce bizarre behavior or otherwise don’t behave like regular text.

The SolidGoldMagikarp saga is pretty much essential context, as it documents the discovery of this phenomenon in GPT-2 and GPT-3.

But, as far as I was able to tell, nobody had yet attempted to search for these tokens in DeepSeek-V3, so I tried doing exactly that. Being a SOTA base model, open source, and an all-around strange LLM, it seemed like a perfect candidate for this.

This is a catalog of the glitch tokens I've found in DeepSeek after a day or so of experimentation, along with some preliminary observations about their behavior.

Note: I’ll be using “DeepSeek” as a generic term for V3 and r1.

Process

I searched for these tokens by first extracting the vocabulary from DeepSeek-V3's tokenizer, and then automatically testing every one of them [...]

---

Outline:

(00:55) Process

(03:30) Fragment tokens

(06:45) Other English tokens

(09:32) Non-English

(12:01) Non-English outliers

(14:09) Special tokens

(16:26) Base model mode

(17:40) Whats next?

The original text contained 1 footnote which was omitted from this narration.

The original text contained 12 images which were described by AI.

---

First published:
January 25th, 2025

Source:
https://www.lesswrong.com/posts/xtpcJjfWhn3Xn8Pu5/anomalous-tokens-in-deepseek-v3-and-r1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

...more
View all episodesView all episodes
Download on the App Store

LessWrong (Curated & Popular)By LessWrong

  • 4.8
  • 4.8
  • 4.8
  • 4.8
  • 4.8

4.8

11 ratings


More shows like LessWrong (Curated & Popular)

View all
Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,383 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

122 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas by Sean Carroll | Wondery

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,076 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

87 Listeners

The Jim Rutt Show by The Jim Rutt Show

The Jim Rutt Show

249 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

90 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

325 Listeners

Hard Fork by The New York Times

Hard Fork

5,367 Listeners

Clearer Thinking with Spencer Greenberg by Spencer Greenberg

Clearer Thinking with Spencer Greenberg

137 Listeners

Razib Khan's Unsupervised Learning by Razib Khan

Razib Khan's Unsupervised Learning

200 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

104 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

64 Listeners

"Econ 102" with Noah Smith and Erik Torenberg by Turpentine

"Econ 102" with Noah Smith and Erik Torenberg

138 Listeners

Complex Systems with Patrick McKenzie (patio11) by Patrick McKenzie

Complex Systems with Patrick McKenzie (patio11)

101 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

0 Listeners