AI: post transformers

KVQuant: LLM Inference with KV Cache Quantization


Listen Later

Three research papers are reviewed:


1) https://arxiv.org/pdf/2401.18079 - KVQuant: Towards 10 Million Context Length LLM

Inference with KV Cache Quantization

2) https://arxiv.org/pdf/2402.02750 - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

3) https://arxiv.org/pdf/2502.04420 - KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache

Quantization for Efficient and Nearly Lossless LLM Inference


These sources collectively discuss methods for quantizing Key-Value (KV) caches in large language models (LLMs) to reduce memory consumption and improve inference efficiency, especially for long context lengths. They explore various quantization strategies, highlighting the importance of per-channel quantization for Keys and per-token quantization for Values due to their distinct data distributions. Key advancements include pre-RoPE quantization, non-uniform quantization, and dense-and-sparse techniques to maintain accuracy at low bitrates, such as 2-bit and 3-bit. The papers also detail custom kernel implementations and offline calibration methods to minimize computational overhead, demonstrating significant throughput gains and increased batch sizes while preserving model performance across diverse benchmarks and LLM architectures.

...more
View all episodesView all episodes
Download on the App Store

AI: post transformersBy mcgrof