Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SolidGoldMagikarp III: Glitch token archaeology, published by mwatkins on February 14, 2023 on LessWrong.
The set of anomalous tokens which we found in mid-January are now being described as 'glitch tokens' in online discussion, as well as (perhaps more playfully) 'forbidden tokens' and 'unspeakable tokens'. We've mostly just called them 'weird tokens'.
Research is ongoing, and a more serious research report will appear soon, but for now we thought it might be worth recording what is known about the origins of the various glitch tokens. Not why they glitch, but why these particular strings have ended up in the GPT-2/3/J token set.
We’re currently working with this somewhat imperfect list of 135. It’s becoming apparent that there are degrees of glitchiness, and it’s hard to know where to draw the line as to which tokens should and shouldn't be included in the collection.
As noted in our second post, quite a few of the tokens belong to 'nested' families, as we see here:
Solid[GoldMagikarp]: ' SolidGoldMagikarp', 'GoldMagikarp'
[The[Nitrome]]Fan: 'Nitrome', ' TheNitrome', ' TheNitromeFan'
[ RandomRedditor]WithNo: ' RandomRedditor', ' RandomRedditorWithNo'
external[ActionCode]: 'ActionCode', 'externalActionCode'
Buyable[Inst[[oreAnd]Online]]: 'oreAnd', 'oreAndOnline', 'InstoreAndOnline', 'BuyableInstoreAndOnline'
[quickShip]Available: 'quickShip', 'quickShipAvailable'
so[DeliveryDate]: 'soDeliveryDate', 'DeliveryDate'
[[ externalTo]EVA]Only: ' externalTo', ' externalToEVA', ' externalToEVAOnly'
[rawdownload][clone[embed[reportprint]]]: 'rawdownload', 'reportprint', 'embedreportprint', 'cloneembedreportprint', 'rawdownloadcloneembedreportprint'
TPP[StreamerBot]: 'TPPStreamerBot', 'StreamerBot'
[ guiActiveUn]focused: ' guiActiveUn', ' guiActiveUnfocused'
[PsyNet]Message: 'PsyNet', 'PsyNetMessage'
[cffff]cc: 'cffffcc', 'cffff'
pet[ertodd]: 'ertodd', ' petertodd'
So let’s look at these families first and kill multiple tokens with single bullet points:
Solid[GoldMagikarp]: We originally thought this had been scraped from some online Pokemon content, but that was a red herring (lol). Eventually we found out that this is a handle of one of the six Redditors who were part of a collective effort to 'count to infinity' over at r/counting. You can read the story of that here or here. SolidGoldMagikarp, the diligent counter whose Reddit handle is now immortalised, was clearly referencing Pokemon with that handle choice: a Magikarp is a Pokemon entity. SolidGoldMagikarp gets two glitch tokens.
[The[Nitrome]]Fan: TheNitromeFan was another of the Reddit counting crew. Presumably a fan of Nitrome, the British video game developer, TheNitromeFan gets three glitch tokens.
[ RandomRedditor]WithNo: That was a pretty random handle chosen by RandomRedditorWithNo, the third of our famed Reddit counters.
The other three Redditors whose handles got scraped from the r/counting 'Hall of Counters' chart due to their prolific posting of ever-larger positive integers were Adinida, Smartstocks (also known as ۂڊῥτ�ӺDṽἙ£ on Reddit) and davidjl123, presumably someone called David, whose full Reddit handle got truncated to davidjl by the tokenisation process.
external[ActionCode]: Google helped solve this one. We would have imagined 'externalActionCode' was a generic database thing, but it seems to be very specific to the HTML behind countless pages recording US Congressional voting. As you can see here,there are over two million web pages indexed as containing this string. It looks like a lot of local and regional US news outlets are using a standard feed from Congress to report voting on legislation. Some programmer somewhere named that property in a fraction of a second with barely a flicker of cognitive effort, unaware that it would one day cause a large language model to go berserk.
Buyable[Inst[[ore...