Mechanical Dreams

SonicMoE- Accelerating MoE with IO and Tile-aware Optimizations


Listen Later

In this episode:
• Introduction to SonicMoE: Professor Norris and Linda introduce the episode's topic, the SonicMoE paper, and discuss the recent trends toward fine-grained and highly sparse Mixture of Experts models.
• The Hardware Inefficiency Problem: The hosts break down why increasing MoE granularity and sparsity leads to major hardware bottlenecks, specifically focusing on IO costs, activation memory, and tile quantization effects.
• Minimizing Activation Memory: Linda explains SonicMoE's clever algorithmic redesign of the backward pass computation graph, which reduces activation memory by 45 percent without adding any extra FLOPs.
• Overlapping Compute and IO: The discussion shifts to GPU-level optimizations. Linda details how the authors use Ping-Pong scheduling and asynchronous TMA on Hopper GPUs to hide memory latency behind matrix math.
• Token Rounding Routing: Professor Norris questions the padding waste in Grouped GEMMs. Linda reveals the paper's novel Token Rounding method that aligns token routing exactly with hardware tile sizes, saving massive compute.
• Conclusion and Impact: The hosts wrap up the episode by discussing the broader implications of SonicMoE for the future of large language model training and scaling laws.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk