Neural Intel Podcast Episode
MoE Giants: Decoding the 670 Billion Parameter Showdown Between DeepSeek V3 and Mistral Large
This week on Neural Intel, we dive deep into the architectural blueprints of two colossal Mixture-of-Experts (MoE) models:Ā DeepSeek V3Ā (673B/671B) andĀ Mistral 3 LargeĀ (675B/673B). We explore the configurations that define these massive language models, noting their shared traits, such as an embedding dimension of 7,168 and a vocabulary size of 129K. Both architectures employ a FeedForward (SwiGLU) module, and the initial three blocks use a dense FFN with a hidden size of 18,432 instead of the MoE layer.
The core of the discussion focuses on how each model utilizes its MoE layer, both of which containĀ 128 experts. We contrast the resource allocation and expert frequency: DeepSeek V3/R1 is configured to activateĀ one shared expert plus six additional experts per tokenĀ (1 shared + 6 experts active per token), resulting in only 37B active parameters per inference step. In contrast, Mistral 3 Large activatesĀ one shared expert plus four additional experts per tokenĀ (1 shared + 4 experts active per token), leading to 39B active parameters per inference step.
We also analyze other crucial architectural differences visible in their configuration files, including the intermediate hidden layer dimensionsā2,048 for DeepSeek V3/R1 versus 4,096 for Mistral 3 Large. Join us as we dissect how these subtle parameter choicesāaffecting multi-head latent attention, expert distribution, and shared expertsāimpact overall efficiency and performance in the race to build the most capable and resourceful large language models.