Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How does GPT-3 spend its 175B parameters?, published by Robert AIZI on January 13, 2023 on LessWrong.
[Target audience: Me from a week ago, and people who have some understanding of ML but want to understand transformers better on a technical level.]
Free advice for people learning new skills: ask yourself random questions. In answering them, you’ll strengthen your understanding and find out what you really understand and what’s actually useful. And some day, if you ask yourself a question that no one has asked before, that’s a publication waiting to happen!
So as I was reading up on transformers, I got fixated on this question: where are the 175 billion parameters in the architecture? Not in the literal sense (the parameters are in the computer), but how are they “spent” between various parts of the architecture - the attention heads vs feed-forward networks, for instance. And how can one calculate the number of parameters from the architecture’s “size hyperparameters” like dimensionality and number of layers?
The goal of this post is to answer those questions, and make sense of this nice table from the GPT-3 paper, deriving the nparams column from the other columns.
Primary Sources
Lots of resources about transformers conjure information from thin air, and I want to avoid that, so I’m showing all my work here. These are the relevant parts of the sources we'll draw from:
Three more details we’ll use, all from Section 2.1 of the GPT-3 paper:
The vocabulary size is [nvocab=]50257 tokens (via a reference to Section 2.3 of the GPT-2 paper)
The feed-forward networks are all a single layer which is “four times the size of the bottleneck layer”, so dff=4dmodel
“All models use a context window of nctx=2048 tokens.”
Variable abbreviations
I’ll use shorthand for the model size variables to increase legibility:
nlayers=xdmodel=ynheads=zdhead=wnvocab=vnctx=u
Where are the Parameters?
From Exhibit A, we can see that the original 1-hot encoding of tokens U is first converted to the initial “residual stream” h0, then passed through transformer blocks (shown in Exhibits B-D), with nlayers blocks total. We'll break down parameter usage by stage:
Word Embedding Parameters
We is the word embedding matrix.
Converts the shape (nctx, nvocab) matrix U into a (nctx,dmodel) matrix, so We has size (nvocab,dmodel), resulting in vy=nvocabdmodel parameters.
Position Embedding Parameters
Wp is the position embedding matrix. Unlike the original transformer paper, GPT learns its position embeddings.
Wp is the same size as the residual stream, (nctx,dmodel), resulting in uy = nctxdmodel parameters
Transformer Parameters - Attention
The attention sublayer of the transformer is one half of the basic transformer block (Exhibit B). As shown in Exhibit C, each attention head in each layer is parameterized by 3 matrices, WQi,WKi,WVi, with one additional matrix WO per layer which combines the attention heads.
What Exhibit C calls dk and dv are both what GPT calls dhead, so WQi,WKi, and WVi are all size (dmodel,dhead). Thus each attention head contributes 3dmodeldhead parameters.
What Exhibit C calls h is what GPT calls nheads, so WO is size (nheads∗dhead,dmodel) and therefore contributes nheadsdheaddmodel parameters.
Total parameters per layer: For a single layer, there are nheads attention heads, so the WQi,WKi, and WVi matrices contribute 3dmodeldheadnheads parameters, plus an additional nheadsdheaddmodel parameters from WO, for a total of 4dmodeldheadnheads
Total parameters: 4xyzw=4dmodeldheadnheadsnlayers
Transformer Parameters - FFN
The “feed-forward network” (FFN) is the other half of the basic transformer block (Exhibit B). Exhibit D shows that it consists of a linear transform parameterized by W1 and b1, an activation function, and then another linear transform parameterized by W2 and b2, as one m...