The Nonlinear Library

LW - LLM Basics: Transformer Token Vectors Are Not Points in Space by NickyP


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLM Basics: Transformer Token Vectors Are Not Points in Space, published by NickyP on February 13, 2023 on LessWrong.
This post is written as an explanation of a misconception I had with transformer embedding when I was getting started. Thanks to Stephen Fowler for the discussion last August that made me realise the misconception, and others for helping me refine my explanation. Any mistakes are my own. Thanks to feedback by Stephen Fowler and JustisMills on this post.
TL;DR: While the token vectors are stored as n-dimensional vectors, thinking of them as points in vector space can be quite misleading. It is better to think of them as directions on a hypersphere, with a size component.
The I think of distance as the Euclidean distance, with formula:
d(x1,x2)=|x1−x2|=√∑i(x1i−x2i)2
Thus does not match up with the distance forumla used when calculating logits:
d(x1,x2)=x1x2=|x1x2|cosθ12
But it does match up with the cosine similarity forumula:
d(x1,x2)=^x1^x2=cosθ12
And so, we can see that the direction and size matter, but not the distance
Introduction
In the study of transformers, it is often assumed that different tokens are embedded as points in a multi-dimensional space. While this concept is partially true, the space in which these tokens are embedded is not a traditional Euclidean space. This is because of the way probabilities of tokens are calculated, as well as how the behaviour of the softmax function affects how tokens are positioned in their space.
This post will have two parts. In the first part, I will briefly explain the relevant parts of the transformer, and in the second part, we will explore what is happening when a transformer moves from an input token to an output token explaining why tokens are better thought of as directions.
Part 1: The Process of a Transformer
Here I will briefly describe how the relevant parts of the transformer work.
First, let's briefly explain the relevant parts at the start of the transformer. We will be studying the "causal" transformer model (ie: that given N tokens, we want to predict the (N+1)th token). The main "pieces" of a causal model are:
The Tokeniser - turns "words" into "tokens" or "tokens" into "words"
The Transformer - turn N tokens into a prediction for the (N+1)th token
Input Embedding - turns "tokens" into "vectors"
Positional Encoder - adds information about position of each token
Many Decoder layers - turns input vectors into prediction vectors
with a Self-Attention sub-layer - uses information from all states
with a Feed Forward sub-layer - uses information from the current state
Output Unembedding - turns prediction vectors into token probabilities
Note that I won't go into much depth about the positional and decoder layers in this post. If there is interest, I may write up another post explaining how they work if there is interest.
Also note, that for simplicity, I will initially assume that the input embedding and the output unembedding are the same. In this case, if you have a token, embed it, and then unembed it, you should get the same token out. The symmetry was true in the era of GPT-2, but nowadays, embedding and unembedding matrices are learned separately, so I will touch on some differences in the end.
Lastly, note that "unembedding" is actually a fake word, and usually it is just called the output embedding. I think unembedding makes more sense, so I will call it unembedding.
1. The Start of the Transformer Process
In the beginning of the transformer process, there is the Tokeniser (converting “words” into “tokens”) and the Token Embedding (converting “tokens” into “token vectors”/”hidden-state vectors”):
So to start:
Input text is received
The text is split into N different “parts” called token ids
The N token ids are converted into N vectors using the embedding matrix
The N vectors are pass...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings