
Sign up to save your podcasts
Or
“Anomalous”, “glitch”, or “unspeakable” tokens in an LLM are those that induce bizarre behavior or otherwise don’t behave like regular text.
The SolidGoldMagikarp saga is pretty much essential context, as it documents the discovery of this phenomenon in GPT-2 and GPT-3.
But, as far as I was able to tell, nobody had yet attempted to search for these tokens in DeepSeek-V3, so I tried doing exactly that. Being a SOTA base model, open source, and an all-around strange LLM, it seemed like a perfect candidate for this.
This is a catalog of the glitch tokens I've found in DeepSeek after a day or so of experimentation, along with some preliminary observations about their behavior.
Note: I’ll be using “DeepSeek” as a generic term for V3 and r1.
Process
I searched for these tokens by first extracting the vocabulary from DeepSeek-V3's tokenizer, and then automatically testing every one of them [...]
---
Outline:
(00:55) Process
(03:30) Fragment tokens
(06:45) Other English tokens
(09:32) Non-English
(12:01) Non-English outliers
(14:09) Special tokens
(16:26) Base model mode
(17:40) Whats next?
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
“Anomalous”, “glitch”, or “unspeakable” tokens in an LLM are those that induce bizarre behavior or otherwise don’t behave like regular text.
The SolidGoldMagikarp saga is pretty much essential context, as it documents the discovery of this phenomenon in GPT-2 and GPT-3.
But, as far as I was able to tell, nobody had yet attempted to search for these tokens in DeepSeek-V3, so I tried doing exactly that. Being a SOTA base model, open source, and an all-around strange LLM, it seemed like a perfect candidate for this.
This is a catalog of the glitch tokens I've found in DeepSeek after a day or so of experimentation, along with some preliminary observations about their behavior.
Note: I’ll be using “DeepSeek” as a generic term for V3 and r1.
Process
I searched for these tokens by first extracting the vocabulary from DeepSeek-V3's tokenizer, and then automatically testing every one of them [...]
---
Outline:
(00:55) Process
(03:30) Fragment tokens
(06:45) Other English tokens
(09:32) Non-English
(12:01) Non-English outliers
(14:09) Special tokens
(16:26) Base model mode
(17:40) Whats next?
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,331 Listeners
2,403 Listeners
7,873 Listeners
4,105 Listeners
87 Listeners
1,449 Listeners
8,765 Listeners
90 Listeners
350 Listeners
5,370 Listeners
14,993 Listeners
468 Listeners
128 Listeners
72 Listeners
438 Listeners