The provided document is a model card introducing Anthropic's Claude 3 family of large multimodal models, which consists of three models: Opus (the most capable), Sonnet (a balance of skills and speed), and Haiku (the fastest and most affordable).
Key highlights from the paper include:
- Core Capabilities & Vision: The Claude 3 family sets new industry benchmarks in reasoning, mathematics, coding, and multilingual understanding. A major new feature is their multimodal vision capabilities, which allow users to upload and analyze visual data such as images, charts, and diagrams alongside text. Opus achieves state-of-the-art results on standard evaluations like GPQA, MMLU, and MMMU.
- Long Context and Recall: The models are offered with a 200,000-token context window (though they are capable of reaching 1 million tokens). In evaluations like the "Needle In A Haystack" test, Claude 3 Opus demonstrated near-perfect recall, consistently extracting specific information from dense documents with over 99% accuracy.
- Behavioral Improvements: Anthropic focused heavily on behavioral design. The Claude 3 models demonstrate improved factual accuracy, better instruction following, and a more nuanced understanding of prompts. Notably, the models exhibit a significant reduction in unnecessary refusals, meaning they are much less likely to unhelpfully refuse benign or harmless prompts compared to previous generations.
- Safety and Catastrophic Risk Assessments: Guided by Anthropic's Responsible Scaling Policy, the models underwent extensive automated and red-teaming evaluations for catastrophic risks, including autonomous replication, biological threats, and cyber capabilities. The evaluations found no indicators of catastrophic risk, classifying the models at the ASL-2 risk level. The report also outlines ongoing mitigations for Trust & Safety, bias, and discrimination.