Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Before smart AI, there will be many mediocre or specialized AIs, published by Lukas Finnveden on May 26, 2023 on The AI Alignment Forum.
Summary:
In the current paradigm, training is much more expensive than inference. So whenever we finish end-to-end training a language model, we can run a lot of them in parallel.
If a language model was trained with Chinchilla scaling laws on the FLOP-equivalent of a large fraction of the world’s current GPU and TPUs: I estimate that the training budget could produce at least ~20 million tokens per second.
Larger models trained on more data would support more tokens per second.
Language models can also run faster than humans. Current models generate 10-100 tokens per second. It’s unclear whether future models will be slower or faster.
This suggests that, before AI changes the world via being broadly superior to human experts, it will change the world via providing a lot of either mediocre (by the standard of human experts) or specialized thinking.
This might make the early alignment problem easier. But the full alignment problem will come soon thereafter, in calendar-time, so this mainly matters if we can use the weaker AI to buy time or make progress on alignment.
More expensive AI you can run more AIs with your training budget
(...assuming that we’re making them more expensive by increasing parameter-count and training data.)
We’re currently in paradigm where:
Training isn’t very sample-efficient.
When increasing capabilities, training costs increase faster (~squared) than inference costs.
Training is massively parallelizable.[1] While this paradigm holds, it implies that the most capable models will be trained using massively parallelized training schemes, equivalent to running a large number of models in parallel. The larger the model, the more data it needs, and so more copies of them will have to be run in parallel during training, in order to finish within a reasonable time-frame.[2]
This means that, once you have trained a highly capable model, you are guaranteed to have the resources to run a huge number of them in parallel. And the bigger and more expensive the model was — the more of them can run in parallel on your training cluster.
Here’s a rough calculation of how many language models you can run in parallel using just your training cluster:
Let’s say you use p parameters.
Running the model for one token takes kp FLOP, for some k.
Chinchilla scaling laws say training data is proportional to parameters, implying that the model is trained for mp tokens.
For Chinchilla, m=20 tokens / parameter.
Total training costs are 3kmp^2.
The 3 is there because backpropagation is ~2x as expensive as forward propagation.
You spend N seconds training your model.
During training, you use (3kmp^2/N) FLOP/s, and at inference you can run one model for kp FLOP/s. So using just your training compute, you can run (3kmp^2/N)/(kp) = 3mp/N tokens per second, just by reallocating your training compute to inference.
If you take a horizon-length framework seriously, you might expect that we’ll need more training data to handle longer-horizon tasks. Let’s introduce a parameter H that describes how many token-equivalents correspond to one data-point.
Total training costs are now 3kmHp^2.
So with the compute you used to train your models, you can process 3mpH/N token-equivalents per second.
Some example numbers (bolded ones are changed from the top one):
For p=1e14, N=1y, H=1, m=20, the above equation says you can process 200 million token-equivalents per second, with just your training budget.
For p=1e15, N=1y, H=1, m=20, it’s ~2 billion token-equivalents/second.
For p=1e14, N=3 months, H=1 hour, m=20, it’s ~1 trillion token-equivalents/second..
In addition, there are various tricks for lowering inference costs. For example, reducing precision (whi...