Byte Sized Breakthroughs

Zero Bubble Pipeline Parallelism


Listen Later

Core idea is think about backward pass into two flows, one to compute grad wrt to parameters, and one to compute grad wrt to output of last layer,
schedule so that you are always working instead of waiting (bubble).
Read full paper: https://arxiv.org/abs/2401.10241
Tags: Systems and Performance, Deep Learning, Machine Learning
...more
View all episodesView all episodes
Download on the App Store

Byte Sized BreakthroughsBy Arjun Srivastava