The Gist Talk

Cross-Layer Attention for KV Cache Optimization


Listen Later

The research introduces Cross-Layer Attention (CLA) as a novel architectural modification designed to mitigate the substantial memory overhead associated with the Key-Value (KV) cache during the decoding phase of large language models (LLMs). Unlike established methods such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), which reduce the cache size by sharing heads within a layer, CLA achieves memory savings by sharing key and value activations across adjacent layers. Extensive experiments conducted on 1B- and 3B-parameter models show that combining CLA with MQA achieves a 2× reduction in KV cache size with minimal impact on accuracy metrics like perplexity. The authors argue that this new technique provides a significant improvement on the accuracy/memory Pareto frontier compared to existing transformer designs. By making LLM serving more memory-efficient, CLA promises to enable practitioners to use models supporting both longer sequence lengths and larger batch sizes

...more
View all episodesView all episodes
Download on the App Store

The Gist TalkBy kw