Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: MetaAI: less is less for alignment., published by Cleo Nardo on June 13, 2023 on LessWrong.
Summary
In May 2023, MetaAI submitted a paper to arxiv called LIMA: Less Is More for Alignment. It's a pretty bad paper and (in my opinion) straightforwardly misleading. Let's get into it.
The Superficial Alignment Hypothesis
The authors present an interesting hypothesis about LLMs
We define the Superficial Alignment Hypothesis: A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.
If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples.
We hypothesize that alignment can be a simple process where the model learns the style or format for interacting with users, to expose the knowledge and capabilities that were already acquired during pretraining.
(1) This hypothesis would have profound implications for AI x-risk
It suggests that we could build a safe competent oracle by pretraining an LLM on the entire internet corpus, and then finetuning the LLM on a curated dataset of safe competent responses.
It suggests that we could build an alignment researcher by pretraining an LLM on the entire internet corpus, and then finetuning the LLM on a curated dataset of alignment research.
(2) Moreover, as by Ulisse Mini writes in their review of the LIMA paper,
Along with TinyStories and QLoRA I'm becoming increasingly convinced that data quality is all you need, definitely seems to be the case for finetuning, and may be the case for base-model training as well. Better scaling laws through higher-quality corpus? Also for who haven't updated, it seems very likely that GPT-4 equivalents will be essentially free to self-host and tune within a year. Plan for this!
(3) Finally, the hypothesis would've supported many of the intuitions in the Simulators sequence by Janus, and I share these intuitions.
So I was pretty excited to read the paper! Unfortunately, the LIMA results were unimpressive upon inspection.
MetaAI's experiment
The authors finetune MetaAI's 65B parameter LLaMa language model on 1000 curated prompts and responses (mostly from StackExchange, wikiHow, and Reddit), and then compare it to five other LLMs (Alpaca 65B, DaVinci003, Bard, Claude, GPT4).
Method:
To compare LIMA to other models, we generate a single response for each test prompt. We then ask crowd workers to compare LIMA outputs to each of the baselines and label which one they prefer. We repeat this experiment, replacing human crowd workers with GPT-4, finding similar agreement levels.
Results:
In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback.
Conclusion:
The fact that simple fine-tuning over so few examples is enough to compete with the state of the art strongly supports the Superficial Alignment Hypothesis, as it demonstrates the power of pretraining and its relative importance over large-scale instruction tuning and reinforcement learning approaches.
Problems with their experiment
(1) Human evaluators
To compare two chatbots A and B, you could ask humans whether they prefer A's response to B's response across 300 test prompts. But this is pretty bad proxy, because here's what users actually care about:
What's the chatbots' accuracy on benchmark tests, e.g. BigBench, MMLU?
Can the chatbot pass a law exam, or a medical exam?
Can the chatbot write Python code that actually matches the specification?
Can the chatbot perform worthwhi...