
Sign up to save your podcasts
Or


This paper "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" introduces a simple and efficient technique for training very large transformer models that exceed the memory limitations of a single GPU. The authors implement an intra-layer model parallel approach that partitions the model's parameters (specifically within the self-attention and MLP blocks) across multiple GPUs. This method requires no custom compilers or library changes, relying instead on a few communication operations inserted into native PyTorch code.
Key achievements and findings include:
• Scalability: The team successfully trained transformer models with up to 8.3 billion parameters using 512 GPUs, sustaining 15.1 PetaFLOPs with 76% scaling efficiency compared to a strong single-GPU baseline.
• State-of-the-Art Results: The resulting models achieved state-of-the-art (SOTA) performance on several language tasks. The 8.3 billion parameter GPT-2 model set new records for perplexity on WikiText103 and accuracy on the LAMBADA dataset, while their 3.9 billion parameter BERT model achieved SOTA results on the RACE dataset.
• BERT Architecture Optimization: The authors discovered that rearranging the placement of layer normalization and residual connections is critical to preventing model degradation when scaling BERT-like models to larger sizes.
By Yun WuThis paper "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" introduces a simple and efficient technique for training very large transformer models that exceed the memory limitations of a single GPU. The authors implement an intra-layer model parallel approach that partitions the model's parameters (specifically within the self-attention and MLP blocks) across multiple GPUs. This method requires no custom compilers or library changes, relying instead on a few communication operations inserted into native PyTorch code.
Key achievements and findings include:
• Scalability: The team successfully trained transformer models with up to 8.3 billion parameters using 512 GPUs, sustaining 15.1 PetaFLOPs with 76% scaling efficiency compared to a strong single-GPU baseline.
• State-of-the-Art Results: The resulting models achieved state-of-the-art (SOTA) performance on several language tasks. The 8.3 billion parameter GPT-2 model set new records for perplexity on WikiText103 and accuracy on the LAMBADA dataset, while their 3.9 billion parameter BERT model achieved SOTA results on the RACE dataset.
• BERT Architecture Optimization: The authors discovered that rearranging the placement of layer normalization and residual connections is critical to preventing model degradation when scaling BERT-like models to larger sizes.