Learning GenAI via SOTA Papers

EP011: ZeRO Solved the Trillion Parameter Memory Wall


Listen Later

Here is a short summary of the paper "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models":

Core Problem and Solution The paper addresses the fundamental memory challenges inherent in training large deep learning models, which often exceed the memory capacity of individual devices. The authors introduce ZeRO (Zero Redundancy Optimizer), a novel solution designed to eliminate memory redundancies in data-parallel training while maintaining high computational and communication efficiency. This approach allows model size to scale efficiently in proportion to the number of devices used.

Key Innovations ZeRO is comprised of two primary sets of optimizations:

ZeRO-DP (Data Parallelism): This replaces the standard replication of model states in data parallelism with partitioning. It has three optimization stages: partitioning optimizer states, gradients, and parameters across data-parallel processes. This can reduce memory consumption linearly with the degree of data parallelism, theoretically allowing for arbitrary model sizes.

ZeRO-R (Residual Memory): This optimizes the remaining memory consumed by activations, temporary buffers, and fragmentation. It includes innovations like partitioned activation checkpointing, which removes activation replication in model parallelism.

Results and Impact The implementation, termed ZeRO-100B, demonstrated significant breakthroughs:

Scale: It successfully trained models with up to 170 billion parameters, an 8x increase over the previous state-of-the-art.• Speed: It achieved a 10x improvement in training speed, reaching an aggregate performance of over 15 Petaflops on a 400 GPU cluster.

Democratization: It allows data scientists to train large models (up to 13 billion parameters) without complex model parallelism, effectively democratizing large model training.

Real-World Application: The system was used to create Turing-NLG, a 17-billion parameter language model that set new accuracy records.

The authors conclude that ZeRO has the potential to scale to models with 1 trillion parameters using current hardware hardware. The technology is released as part of Microsoft's open-source library, DeepSpeed.

...more
View all episodesView all episodes
Download on the App Store

Learning GenAI via SOTA PapersBy Yun Wu