The Megatron Problem — Show Notes
DTF:FTL Episode 0030 | March 12, 2026
Every competitive frontier model going forward is sparse. Mixture-of-Experts architectures decouple parameter count from per-token compute — but training them at scale creates coupled constraints across memory, communication, and computation that dense models never had. NVIDIA's Megatron Core team published the full engineering receipt: 88 pages, 42 figures, production-tested on clusters of thousands of GPUs.
Why it matters.
MoE is not a research curiosity. DeepSeek-V3, Qwen3, Mixtral, and most frontier models in active development are sparse. The question was never whether MoE architectures were theoretically superior — the question was whether anyone could actually train them efficiently at scale. This paper answers that question with production numbers: one thousand two hundred thirty-three TFLOPS per GPU on GB300 for a 685-billion-parameter model, roughly 50 percent of theoretical hardware peak. The framework is open source. Any serious lab can now train competitive sparse models. The moat just got narrower.
Primary Source
Paper: Scalable Training of Mixture-of-Experts Models with Megatron Core — https://arxiv.org/abs/2603.07685Megatron-LM GitHub: https://github.com/NVIDIA/Megatron-LMMegatron-Core (within Megatron-LM): https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/coreModels Referenced
DeepSeek-V3 Technical Report: https://arxiv.org/abs/2412.19437DeepSeek-V3 GitHub: https://github.com/deepseek-ai/DeepSeek-V3Qwen3 Technical Report / Blog: https://qwenlm.github.io/blog/qwen3/Qwen GitHub: https://github.com/QwenLM/Qwen3Mixtral of Experts (MoE paper, Mistral AI): https://arxiv.org/abs/2401.04088MoE Foundations
Sparsely-Gated Mixture-of-Experts (Shazeer et al., 2017): https://arxiv.org/abs/1701.06538Switch Transformer (Google, 2021): https://arxiv.org/abs/2101.03961GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (Google): https://arxiv.org/abs/2112.06905Expert Choice Routing (Zhou et al., 2022): https://arxiv.org/abs/2202.09368Parallelism and Training Infrastructure
Megatron-LM: Training Multi-Billion Parameter Language Models (original paper): https://arxiv.org/abs/1909.08053Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al., 2021): https://arxiv.org/abs/2104.04473GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding: https://arxiv.org/abs/2006.16668FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models: https://arxiv.org/abs/2201.12023Compute Primitives
FlashAttention-2 (Dao, 2023): https://arxiv.org/abs/2307.08691Grouped GEMM (cutlass): https://github.com/NVIDIA/cutlassNVIDIA CUDA Graphs documentation: https://developer.nvidia.com/blog/cuda-graphs/Hardware
NVIDIA GB200 NVL72 architecture overview: https://www.nvidia.com/en-us/data-center/gb200-nvl72/NVIDIA Blackwell GPU architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/Related Reading
Scaling Laws for Neural Language Models (Kaplan et al.): https://arxiv.org/abs/2001.08361DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale: https://arxiv.org/abs/2201.05596DTF:FTL — Dispatches from the edge. New episodes daily. All content AI-assisted; factual claims sourced from cited papers.