February 14, 2023

LW - LLM Basics: Transformer Token Vectors Are Not Points in Space by NickyP

22 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLM Basics: Transformer Token Vectors Are Not Points in Space, published by NickyP on February 13, 2023 on LessWrong.

This post is written as an explanation of a misconception I had with transformer embedding when I was getting started. Thanks to Stephen Fowler for the discussion last August that made me realise the misconception, and others for helping me refine my explanation. Any mistakes are my own. Thanks to feedback by Stephen Fowler and JustisMills on this post.

TL;DR: While the token vectors are stored as n-dimensional vectors, thinking of them as points in vector space can be quite misleading. It is better to think of them as directions on a hypersphere, with a size component.

The I think of distance as the Euclidean distance, with formula:

d(x1,x2)=|x1−x2|=√∑i(x1i−x2i)2

Thus does not match up with the distance forumla used when calculating logits:

d(x1,x2)=x1x2=|x1x2|cosθ12

But it does match up with the cosine similarity forumula:

d(x1,x2)=^x1^x2=cosθ12

And so, we can see that the direction and size matter, but not the distance

Introduction

In the study of transformers, it is often assumed that different tokens are embedded as points in a multi-dimensional space. While this concept is partially true, the space in which these tokens are embedded is not a traditional Euclidean space. This is because of the way probabilities of tokens are calculated, as well as how the behaviour of the softmax function affects how tokens are positioned in their space.

This post will have two parts. In the first part, I will briefly explain the relevant parts of the transformer, and in the second part, we will explore what is happening when a transformer moves from an input token to an output token explaining why tokens are better thought of as directions.

Part 1: The Process of a Transformer

Here I will briefly describe how the relevant parts of the transformer work.

First, let's briefly explain the relevant parts at the start of the transformer. We will be studying the "causal" transformer model (ie: that given N tokens, we want to predict the (N+1)th token). The main "pieces" of a causal model are:

The Tokeniser - turns "words" into "tokens" or "tokens" into "words"

The Transformer - turn N tokens into a prediction for the (N+1)th token

Input Embedding - turns "tokens" into "vectors"

Positional Encoder - adds information about position of each token

Many Decoder layers - turns input vectors into prediction vectors

with a Self-Attention sub-layer - uses information from all states

with a Feed Forward sub-layer - uses information from the current state

Output Unembedding - turns prediction vectors into token probabilities

Note that I won't go into much depth about the positional and decoder layers in this post. If there is interest, I may write up another post explaining how they work if there is interest.

Also note, that for simplicity, I will initially assume that the input embedding and the output unembedding are the same. In this case, if you have a token, embed it, and then unembed it, you should get the same token out. The symmetry was true in the era of GPT-2, but nowadays, embedding and unembedding matrices are learned separately, so I will touch on some differences in the end.

Lastly, note that "unembedding" is actually a fake word, and usually it is just called the output embedding. I think unembedding makes more sense, so I will call it unembedding.

1. The Start of the Transformer Process

In the beginning of the transformer process, there is the Tokeniser (converting “words” into “tokens”) and the Token Embedding (converting “tokens” into “token vectors”/”hidden-state vectors”):

So to start:

Input text is received

The text is split into N different “parts” called token ids

The N token ids are converted into N vectors using the embedding matrix

The N vectors are pass...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

February 14, 2023

LW - LLM Basics: Transformer Token Vectors Are Not Points in Space by NickyP

22 minutes

The I think of distance as the Euclidean distance, with formula:

d(x1,x2)=|x1−x2|=√∑i(x1i−x2i)2

Thus does not match up with the distance forumla used when calculating logits:

d(x1,x2)=x1x2=|x1x2|cosθ12

But it does match up with the cosine similarity forumula:

d(x1,x2)=^x1^x2=cosθ12

And so, we can see that the direction and size matter, but not the distance

Introduction

Part 1: The Process of a Transformer

Here I will briefly describe how the relevant parts of the transformer work.

The Tokeniser - turns "words" into "tokens" or "tokens" into "words"

The Transformer - turn N tokens into a prediction for the (N+1)th token

Input Embedding - turns "tokens" into "vectors"

Positional Encoder - adds information about position of each token

Many Decoder layers - turns input vectors into prediction vectors

with a Self-Attention sub-layer - uses information from all states

with a Feed Forward sub-layer - uses information from the current state

Output Unembedding - turns prediction vectors into token probabilities

Note that I won't go into much depth about the positional and decoder layers in this post. If there is interest, I may write up another post explaining how they work if there is interest.

Lastly, note that "unembedding" is actually a fake word, and usually it is just called the output embedding. I think unembedding makes more sense, so I will call it unembedding.

1. The Start of the Transformer Process

So to start:

Input text is received

The text is split into N different “parts” called token ids

The N token ids are converted into N vectors using the embedding matrix

The N vectors are pass...

...more

Share LW - LLM Basics: Transformer Token Vectors Are Not Points in Space by NickyP

Sign up to save your podcasts

LW - LLM Basics: Transformer Token Vectors Are Not Points in Space by NickyP

LW - LLM Basics: Transformer Token Vectors Are Not Points in Space by NickyP