The paper presents Llama 3, a new family of foundation language models developed by Meta, featuring models with 8B, 70B, and a flagship 405B parameters. These models natively support multilinguality, coding, reasoning, and tool usage, with the 405B model capable of processing information in a context window of up to 128K tokens.
The development of Llama 3 focuses on optimizing data, scale, and complexity:
- Pre-training: The models were pre-trained on a massive corpus of 15.6 trillion tokens, which is substantially larger and higher quality than the data used for Llama 2.
- Post-training: The models underwent rigorous alignment using supervised finetuning (SFT), rejection sampling, and direct preference optimization (DPO) to better follow instructions and ensure helpfulness and harmlessness.
Extensive empirical and human evaluations demonstrate that the flagship 405B model performs on par with leading closed-source models like GPT-4 across a wide variety of tasks, while the 8B and 70B models establish best-in-class performance compared to alternative models of similar sizes.
The paper also highlights robust safety measures, including the release of Llama Guard 3 for system-level input and output safety. Finally, the authors detail ongoing, unreleased experiments integrating image, video, and speech capabilities into Llama 3 using a compositional approach, which has shown competitive results against state-of-the-art multimodal models.