Daily Tech Feed: From the Labs

The Megatron Problem


Listen Later

The Megatron Problem — Show Notes

DTF:FTL Episode 0030 | March 12, 2026

Every competitive frontier model going forward is sparse. Mixture-of-Experts architectures decouple parameter count from per-token compute — but training them at scale creates coupled constraints across memory, communication, and computation that dense models never had. NVIDIA's Megatron Core team published the full engineering receipt: 88 pages, 42 figures, production-tested on clusters of thousands of GPUs.

Why it matters.

MoE is not a research curiosity. DeepSeek-V3, Qwen3, Mixtral, and most frontier models in active development are sparse. The question was never whether MoE architectures were theoretically superior — the question was whether anyone could actually train them efficiently at scale. This paper answers that question with production numbers: one thousand two hundred thirty-three TFLOPS per GPU on GB300 for a 685-billion-parameter model, roughly 50 percent of theoretical hardware peak. The framework is open source. Any serious lab can now train competitive sparse models. The moat just got narrower.

Primary Source
  • Paper: Scalable Training of Mixture-of-Experts Models with Megatron Core — https://arxiv.org/abs/2603.07685
  • Megatron-LM GitHub: https://github.com/NVIDIA/Megatron-LM
  • Megatron-Core (within Megatron-LM): https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core
  • Models Referenced
    • DeepSeek-V3 Technical Report: https://arxiv.org/abs/2412.19437
    • DeepSeek-V3 GitHub: https://github.com/deepseek-ai/DeepSeek-V3
    • Qwen3 Technical Report / Blog: https://qwenlm.github.io/blog/qwen3/
    • Qwen GitHub: https://github.com/QwenLM/Qwen3
    • Mixtral of Experts (MoE paper, Mistral AI): https://arxiv.org/abs/2401.04088
    • MoE Foundations
      • Sparsely-Gated Mixture-of-Experts (Shazeer et al., 2017): https://arxiv.org/abs/1701.06538
      • Switch Transformer (Google, 2021): https://arxiv.org/abs/2101.03961
      • GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (Google): https://arxiv.org/abs/2112.06905
      • Expert Choice Routing (Zhou et al., 2022): https://arxiv.org/abs/2202.09368
      • Parallelism and Training Infrastructure
        • Megatron-LM: Training Multi-Billion Parameter Language Models (original paper): https://arxiv.org/abs/1909.08053
        • Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al., 2021): https://arxiv.org/abs/2104.04473
        • GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding: https://arxiv.org/abs/2006.16668
        • FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models: https://arxiv.org/abs/2201.12023
        • Compute Primitives
          • FlashAttention-2 (Dao, 2023): https://arxiv.org/abs/2307.08691
          • Grouped GEMM (cutlass): https://github.com/NVIDIA/cutlass
          • NVIDIA CUDA Graphs documentation: https://developer.nvidia.com/blog/cuda-graphs/
          • Hardware
            • NVIDIA GB200 NVL72 architecture overview: https://www.nvidia.com/en-us/data-center/gb200-nvl72/
            • NVIDIA Blackwell GPU architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
            • Related Reading
              • Scaling Laws for Neural Language Models (Kaplan et al.): https://arxiv.org/abs/2001.08361
              • DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale: https://arxiv.org/abs/2201.05596
              • DTF:FTL — Dispatches from the edge. New episodes daily. All content AI-assisted; factual claims sourced from cited papers.

                ...more
                View all episodesView all episodes
                Download on the App Store

                Daily Tech Feed: From the LabsBy Daily Tech Feed