Share Grouped-Query Attention: Speed and Quality Through Uptraining

Copy link

December 01, 2025

Grouped-Query Attention: Speed and Quality Through Uptraining

35 minutes

The source presents a technical paper addressing the significant memory bandwidth overhead that slows down autoregressive decoder inference in large Transformer models. This work offers two core solutions: first, a method called uptraining allows existing high-quality multi-head attention (MHA) checkpoints to be converted into faster models using only a small percentage of their original training compute. Second, the authors introduce grouped-query attention (GQA), which serves as a generalization and quality-preserving intermediate step between MHA and the faster but less stable multi-query attention (MQA). GQA operates by dividing query heads into small groups, each sharing a single key and value head derived through mean pooling the original heads. Experimental results confirm that these uptrained GQA models achieve performance comparable to MHA while delivering inference speeds nearly as fast as MQA, successfully balancing quality and computational efficiency

...more

View all episodes

By kw

December 01, 2025

Grouped-Query Attention: Speed and Quality Through Uptraining

35 minutes

...more

Sign up to save your podcasts