The provided text is a research paper introducing DeepSeek-V2, a highly capable, open-source Mixture-of-Experts (MoE) large language model developed by DeepSeek-AI. The paper details the model's architectural innovations, training process, and evaluation results, highlighting its ability to deliver top-tier performance while maintaining economical training and highly efficient inference costs.
Here is a short summary of the key points from the paper:
Model Scale and Capacity
- DeepSeek-V2 features 236 billion total parameters, but relies on sparse computation so that only 21 billion parameters are activated per token.
- It supports a massive context window length of up to 128K tokens.
Core Architectural InnovationsThe model achieves its efficiency through two primary architectural upgrades to the standard Transformer framework:
- Multi-Head Latent Attention (MLA): To address the heavy memory bottleneck caused by Key-Value (KV) caching during inference, MLA significantly compresses the KV cache into a latent vector. This innovation reduces the KV cache by 93.3% compared to their previous dense model (DeepSeek 67B) and boosts the maximum generation throughput by 5.76 times.
- DeepSeekMoE: The model's Feed-Forward Networks utilize a specialized MoE architecture that features fine-grained expert segmentation and shared expert isolation. This allows for a much more economical training process, saving 42.5% in training costs compared to DeepSeek 67B.
Training Pipeline
- Pre-Training: The model is initially trained on a high-quality, bilingual (English and Chinese) corpus consisting of 8.1 trillion tokens.
- Alignment: Following pre-training, the model undergoes Supervised Fine-Tuning (SFT) using 1.5 million conversational sessions, followed by Reinforcement Learning (RL) using Group Relative Policy Optimization (GRPO) to further align the model with human preferences, reasoning, and safety.
Performance and ResultsDespite only activating 21 billion parameters per token, DeepSeek-V2 establishes itself as a top-tier open-source model. In comprehensive benchmark evaluations, it rivals or outperforms other leading open-source models like Qwen1.5 72B, Mixtral 8x22B, and LLaMA 3 70B, showing particularly overwhelming advantages in Chinese language comprehension, mathematics, and coding tasks.
In short, the paper demonstrates that through intelligent architectural design (MLA and DeepSeekMoE), it is possible to scale up a model's intelligence and parameters while significantly reducing the computational overhead typically required for training and serving large language models.