Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring Learned Optimization in Small Transformer Models, published by Jonathan Bostock on April 8, 2024 on The AI Alignment Forum.
This is original, independent research carried out in March and April of 2024.
The degree to which a a policy optimizes the future can be quantified mathematically. A set of of very small transformer models were pretrained to predict the next token in a mathematical sequence, then subjected to reinforcement learning finetuning.
The optimizing power of each model can be predicted with high accuracy based on each model's score on its own RL task. By comparing predictions of optimization based on scores on each different RL task, a model's original reinforcement objective can be identified.
A related measure for impact can also be derived mathematically, and given a theoretical lower bound based on RL score. This gives further information about model behavior, and allows for the same analysis as the measure of optimization.
I also investigate the possibility of getting models to self-evaluate optimization and impact, with limited success.
Methods
Pretraining on Sequence Prediction
I defined a simple mathematical sequence defined by the following stochastic recurrence relation. This produces a pseudo-random but (to 98%) predictable sequence, alternating between elements of {0,...,7} on even values of t and {8,...,15} on odd values of t.
st=(((16i=1(sti+1)mod17)mod8) with probability 98% {0,...,7} with probabiltiy 2%)+8(tmod2)
I then trained a small encoder-only transformer model to predict the next element in the sequence given the previous 20 elements of the sequence.
This was followed by a reinforcement-learning phase in which the transformer was used to generate the next token on odd values of n only, and the recurrence relation was used to generate the value of st+1. If st+1 was in {0,2,4,6}, this was used as a "successful" example to reinforce the model. I used a temperature of 1 when generating these sequences to introduce some randomness, but the temperature was reduced to 0 during evaluations and when calculating optimization.
A small amount of "maintenance" training (much lower learning rate) was used during this phase to ensure that model performance on the predictive tasks for even values of t was maintained. Without this I saw rapid loss of performance on the "maintenance" dataset. I also found that I was unable to include "unsuccessful" examples (i.e. where st+1{0,2,4,6}) with even a tiny negative learning rate, as this caused worsened performance at all tasks.
Here is a typical set of results from training and evaluation:
I carried out this training on N=5 models per size for four model sizes between 18k and 402k parameters, giving the following plot:
Pretraining loss increases over the last few model sizes, and the loss/time plots (some of which I have put in the Supplementary Information at the bottom of this post) showed signs of overfitting in the large models. Regularization was employed during training (0.01 weight decay in an AdamW optimizer, 10% dropout rate for neurons) so perhaps a larger dataset size is required to totally avoid this.
I then repeated the RL phase twice, once with st+1{0,4} being reinforced, (ngood = 2) and once with st+1{0,1,2,4,5,6} being reinforced (ngood = 6). Here is a plot of success rate against model size across all three conditions.
This plot shows mean standard error. In all cases model performance is a lot better than chance, and increases with model size.
Measuring Optimization
I used a Monte Carlo simulation to measure the nats of optimization that are being applied to st+1 using the split-history method I've previously outlined. This involves taking the difference in entropy between two distributions:
The algorithm in practice is this:
Take a bunch of sequence examples from the testing...