The Nonlinear Library

LW - Llama We Doing This Again? by Zvi


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Llama We Doing This Again?, published by Zvi on July 26, 2023 on LessWrong.
I've finally had an opportunity to gather the available information about Llama-2 and take an in-depth look at the system card.
My conclusion is that Llama-2 looks to score about 3.4 GPTs, with coding as its relative weak point. The system card tries to claim better performance than that in some places in rather misleading fashion, but in other places it does not make such claims.
For its intended purposes it is now the best open source model, while remaining well behind closed source models. There is substantial improvement over Llama-1 in capabilities, it comes with fine tuning, and also with an attempt at harmlessness.
That attempt at harmlessness appears even more ham-fisted than usual. The claims of a 0.05% (!?!) false refusal rate are clearly very false. Early public red teaming quickly revealed a number of problems, in a model that cannot be unreleased or fully patched.
Llama We Doing This Again?
Meta notices world not yet destroyed and people remain alive, so it has not open sourced enough models. Hence it released Llama 2. Here's the paper, here's the blog announcement, here is a download link to GitHub. Here's Llama-70B on Replicate.
Simon Willison (re: Replicate): Here's how to use it with LLM:
llm replicate add \
replicate/llama70b-v2-chat \
chat -alias llama70b Then: llm -m llama70b "Invent an absurd ice cream sundae"
Here's Jim Fan's video guide to fine-tuning using Replicate. Here is Replicate's official guide.
Here are alternative instructions and a script for training Llama-2 on your own data. Doing this with the 7B model can be done on a T4 GPU, for 70B you'll need an A100.
Here's an alternative instruction and cookbook from ScaleAI.
Here's a link to chat with Llama-2 via Perplexity.
I'll go through the paper. The paper spells out how Llama-2 was trained, spelling out all sorts of parameters. Almost all of them seem standard, but knowing is valuable.
The System Card
Llama 2 has double the context length of Llama 1, and was trained on 40% more data.
They have a chart claiming Llama-2 outperforms MPT and Falcon on various benchmarks.
They claim that GPT-4 thinks Llama-2 outperforms GPT-3.5.
The next observation is that Llama-2 is, if you use their own metrics, plays it I would characterize as 'too safe.'
ChatGPT's rate of violations here is about 7%. Reducing that to 4%, as Llama-2 is claiming, implies an extreme level of caution, or it implies they have greatly surpassed OpenAI's ability to defend against adversarial examples. I know which way I would bet.
The 7b, 13b and 70b models have been released for commercial use.
A strange note is that they did not train using data from Meta's services. They had one big advantage, and they did not even use it? This seems to be due to a desire to avoid sources with a lot of private information. If that is the concern, then given the nature of Meta's available data, they have a big problem.
Their report on training techniques might as well say 'we used standard techniques.' The RLHF is 'we asked people which of two responses was better.' There are some numbers listed and I assume they are completely standard. Biggest change was small amount of 'high quality' fine tuning data.
How are its capabilities? Here is a comparison. Note which models they chose to compare themselves to here, and which ones they did not. At a given size, this seems like a modest improvement over MPT and Falcon.
This table covers the real comparisons. Why use different benchmarks, I wonder?
This tells us that Llama-2 is potentially similar to PaLM-1, with two of its three scores similar to GPT 3.5. Then later they show this:
We report the results in terms of accuracy in Table 7. As expected, our own reward models perform the best on our internal test sets collected...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings