Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SmartyHeaderCode: anomalous tokens for GPT3.5 and GPT-4, published by AdamYedidia on April 15, 2023 on LessWrong.
TL;DR: There are anomalous tokens for GPT3.5 and GPT4 which are difficult or impossible for the model to repeat; try playing around with SmartyHeaderCode, APolynomial, or davidjl. There are also plenty which can be repeated but are difficult for the model to spell out, like edTextBox or legalArgumentException.
A couple months ago, Jessica Rumbelow and mwatkins posted about anomalous tokens that cause GPT-2 and GPT-3 to fail. Those anomalous tokens don't cause the same failures on newer models, such as GPT-3.5 Default or GPT-4 on the ChatGPT website, or gpt-3.5-turbo over the API, because the newer models use a different tokenizer. For a very brief explanation of what a tokenizer is doing, the tokenizer has a large vocabulary of tokens, and it encodes ordinary text as a sequence of symbols from that vocabulary.
For example, the string Hello world! gets encoded by the GPT-2 tokenizer as the sequence [15496, 995, 0], meaning that it's a sequence of three tokens, the first of which is the 15,946th token of the vocabulary, or Hello, the second of which is the 995th token of the vocabulary, or world, and the third of which is the 0th token of the vocabulary, or !. In general, a long string being represented by a single token implies that that string appears a lot in the training set (or whatever corpus was used to build the tokenizer), because otherwise it wouldn't have been "worth it" to give that string its own token.
Because of the change in tokenizers, almost all of the tokens which produce anomalous behavior in GPT-2 and GPT-3 don't produce anomalous behavior in the later models, because rather than being a single weird token, they're broken up into many, more normal tokens. For example, SolidGoldMagikarp was encoded as a the single token SolidGoldMagikarp by the old tokenizer, but is encoded as five tokens by the new tokenizer: [' Solid', 'Gold', 'Mag', 'ik', 'arp']. Each of those five tokens is normal and common, so GPT-4 handles them just fine.
Also, it's conveniently the case that tokenizers are released to the public and generally seem to be ordered, with earlier tokens in the vocabulary being shorter, more common, and more ordinary, and later tokens in the vocabulary being longer, less common, and weirder.
The old tokenizer, r50k_base, used a vocabulary of about 50,000 tokens and was used by GPT-2 and GPT-3 (and possibly GPT-3.5 Legacy?). The new tokenizer, used by GPT-3.5 Default and GPT-4, is called cl100k_base and has a vocabulary of about 100,000 tokens. Unfortunately, we can't straightforwardly repeat the experiment that Jessica Rumbelow and mwatkins ran, of running k-means clustering on the model's embedding matrix, because (to my knowledge) we don't have access to the embedding matrix of the newer models. Instead, however, we can just look at the later tokens in the cl100k_base vocabulary and try messing around with each of them; the later tokens, being longer, rarer, and weirder, are easier to use to create prompts that are far from the model's training distribution.
To give a sense for what a completely random sample of late-vocabulary cl100k_base tokens look like, here's tokens 98,000 through 98,020:
['.Cdecl', 'InstantiationException', ' collage', ' IOC', ' bais', ' onFinish', '-stars', 'setSize', 'mogul', ' disillusion', ' chevy', '(Schedulers', '(IR', '_locs', ' cannons', ' cancelling', '/bus', ' bufio', ' Yours', ' Pikachu', ' terme']
I searched through tokens 98,000 through 99,999 in the cl100k_base vocabulary. I focused on just the tokens that contained only Roman-alphabet characters and spaces, to avoid confusing it for uninteresting reasons (like asking it to repeat the string ("");, which contains enough punctuation that it might f...