March 12, 2026

The Megatron Problem

18 minutes

The Megatron Problem — Show Notes

DTF:FTL Episode 0030 | March 12, 2026

Every competitive frontier model going forward is sparse. Mixture-of-Experts architectures decouple parameter count from per-token compute — but training them at scale creates coupled constraints across memory, communication, and computation that dense models never had. NVIDIA's Megatron Core team published the full engineering receipt: 88 pages, 42 figures, production-tested on clusters of thousands of GPUs.

Why it matters.

MoE is not a research curiosity. DeepSeek-V3, Qwen3, Mixtral, and most frontier models in active development are sparse. The question was never whether MoE architectures were theoretically superior — the question was whether anyone could actually train them efficiently at scale. This paper answers that question with production numbers: one thousand two hundred thirty-three TFLOPS per GPU on GB300 for a 685-billion-parameter model, roughly 50 percent of theoretical hardware peak. The framework is open source. Any serious lab can now train competitive sparse models. The moat just got narrower.

Primary Source

Paper: Scalable Training of Mixture-of-Experts Models with Megatron Core — https://arxiv.org/abs/2603.07685

Megatron-LM GitHub: https://github.com/NVIDIA/Megatron-LM

Megatron-Core (within Megatron-LM): https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core

Models Referenced

DeepSeek-V3 Technical Report: https://arxiv.org/abs/2412.19437

DeepSeek-V3 GitHub: https://github.com/deepseek-ai/DeepSeek-V3

Qwen3 Technical Report / Blog: https://qwenlm.github.io/blog/qwen3/

Qwen GitHub: https://github.com/QwenLM/Qwen3

Mixtral of Experts (MoE paper, Mistral AI): https://arxiv.org/abs/2401.04088

MoE Foundations

Sparsely-Gated Mixture-of-Experts (Shazeer et al., 2017): https://arxiv.org/abs/1701.06538

Switch Transformer (Google, 2021): https://arxiv.org/abs/2101.03961

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (Google): https://arxiv.org/abs/2112.06905

Expert Choice Routing (Zhou et al., 2022): https://arxiv.org/abs/2202.09368

Parallelism and Training Infrastructure

Megatron-LM: Training Multi-Billion Parameter Language Models (original paper): https://arxiv.org/abs/1909.08053

Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al., 2021): https://arxiv.org/abs/2104.04473

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding: https://arxiv.org/abs/2006.16668

FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models: https://arxiv.org/abs/2201.12023

Compute Primitives

FlashAttention-2 (Dao, 2023): https://arxiv.org/abs/2307.08691

Grouped GEMM (cutlass): https://github.com/NVIDIA/cutlass

NVIDIA CUDA Graphs documentation: https://developer.nvidia.com/blog/cuda-graphs/

Hardware

NVIDIA GB200 NVL72 architecture overview: https://www.nvidia.com/en-us/data-center/gb200-nvl72/

NVIDIA Blackwell GPU architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/

The Megatron Problem

18 minutes

The Megatron Problem — Show Notes

DTF:FTL Episode 0030 | March 12, 2026

Why it matters.

Primary Source

Paper: Scalable Training of Mixture-of-Experts Models with Megatron Core — https://arxiv.org/abs/2603.07685

Megatron-LM GitHub: https://github.com/NVIDIA/Megatron-LM

Megatron-Core (within Megatron-LM): https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core

Models Referenced

DeepSeek-V3 Technical Report: https://arxiv.org/abs/2412.19437

DeepSeek-V3 GitHub: https://github.com/deepseek-ai/DeepSeek-V3

Qwen3 Technical Report / Blog: https://qwenlm.github.io/blog/qwen3/

Qwen GitHub: https://github.com/QwenLM/Qwen3

Mixtral of Experts (MoE paper, Mistral AI): https://arxiv.org/abs/2401.04088

MoE Foundations

Sparsely-Gated Mixture-of-Experts (Shazeer et al., 2017): https://arxiv.org/abs/1701.06538

Switch Transformer (Google, 2021): https://arxiv.org/abs/2101.03961

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (Google): https://arxiv.org/abs/2112.06905

Expert Choice Routing (Zhou et al., 2022): https://arxiv.org/abs/2202.09368

Parallelism and Training Infrastructure

Megatron-LM: Training Multi-Billion Parameter Language Models (original paper): https://arxiv.org/abs/1909.08053

Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al., 2021): https://arxiv.org/abs/2104.04473

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding: https://arxiv.org/abs/2006.16668

FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models: https://arxiv.org/abs/2201.12023

Compute Primitives

FlashAttention-2 (Dao, 2023): https://arxiv.org/abs/2307.08691

Grouped GEMM (cutlass): https://github.com/NVIDIA/cutlass

NVIDIA CUDA Graphs documentation: https://developer.nvidia.com/blog/cuda-graphs/

Hardware

NVIDIA GB200 NVL72 architecture overview: https://www.nvidia.com/en-us/data-center/gb200-nvl72/

NVIDIA Blackwell GPU architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/

Share The Megatron Problem

Sign up to save your podcasts

The Megatron Problem

The Megatron Problem