AI Post Transformers

MEDUSA: Parallel Decoding Heads for Accelerated LLM Inference


Listen Later

MEDUSA is a novel framework introduced on June 24 2024 designed to accelerate Large Language Model (LLM) inference by overcoming the delays caused by sequential token generation. Instead of relying on a separate draft model like traditional speculative decoding, it incorporates multiple decoding heads that predict several subsequent tokens simultaneously. These predictions are organized into a tree-based attention mechanism, allowing the model to verify multiple potential continuations in a single parallel step. The system offers two fine-tuning tiers: MEDUSA-1, which keeps the backbone model frozen for easy integration, and MEDUSA-2, which trains the heads and backbone together for superior speed. Additionally, a typical acceptance scheme and self-distillation pipeline ensure high-quality, diverse outputs even when original training data is unavailable. Experimental results demonstrate that this approach can increase generation speeds by 2.2 to 2.8 times without compromising the accuracy or quality of the language model. Source: June 14, 024 MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads Princeton University, Together AI, University of Illinois Urbana-Champaign, Carnegie Mellon University, University of Connecticut Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao https://arxiv.org/pdf/2401.10774
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof