Qwen2.5 is a comprehensive series of large language models (LLMs) designed to handle a diverse range of tasks, featuring significant enhancements over its predecessor, Qwen2. The series offers both open-weight dense models (ranging from 0.5B to 72B parameters) and proprietary Mixture-of-Experts (MoE) models (Qwen2.5-Turbo and Qwen2.5-Plus).
The key advancements of the Qwen2.5 series include:
- Massive Pre-training Data: The models were pre-trained on a scaled-up dataset of 18 trillion tokens (compared to 7 trillion for Qwen2). The team improved data filtering and heavily incorporated high-quality math, coding, and synthetic data to build a strong foundation for expert knowledge and reasoning.
- Advanced Post-training: Qwen2.5 underwent intricate post-training using over 1 million supervised fine-tuning (SFT) samples and a two-stage reinforcement learning approach (Offline DPO and Online GRPO). This significantly improved its instruction-following, long text generation, structural data analysis, and human preference alignment.
- Expanded Context Window: The models feature major upgrades in context processing. While the standard models support up to 128K tokens, Qwen2.5-Turbo supports a context length of up to 1 million tokens. The generation length has also been increased from 2K to 8K tokens.
- State-of-the-Art Performance: Qwen2.5 demonstrates top-tier capabilities across various benchmarks evaluating language understanding, mathematics, coding, and reasoning. Notably, the flagship open-weight model, Qwen2.5-72B-Instruct, performs competitively against the state-of-the-art Llama-3-405B-Instruct, despite being about five times smaller. Furthermore, the proprietary MoE models offer superior cost-effectiveness while rivaling GPT-4o-mini and GPT-4o.