The provided sources explore advanced techniques for optimizing large language model (LLM) inference, specifically by addressing the memory bottlenecks of the Key-Value (KV) cache. KVQuant introduces a high-precision quantization framework that utilizes per-channel scaling, non-uniform datatypes, and sparse outlier handling to compress activations to sub-4-bit precision with minimal accuracy loss. Similarly, the KIVI algorithm proposes a tuning-free 2-bit quantization strategy that differentiates between key and value cache distributions to increase throughput. Shifting from quantization to architectural pruning, DuoAttention identifies specific Retrieval Heads that require full context while reducing Streaming Heads to constant memory usage by focusing only on recent tokens and attention sinks. Together, these methods enable LLMs to process million-level context lengths on standard hardware by drastically reducing the architectural and computational footprint of stored activations.Sources:1) 2024KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache QuantizationUniversity of California, Berkeley, ICSI, LBNLColeman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholamihttps://arxiv.org/pdf/2401.180792) 2024KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV CacheRice University, Texas A&M University, Stevens Institute of Technology, Carnegie Mellon UniversityZirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Huhttps://arxiv.org/pdf/2402.027503) 2024QAQ: Quality Adaptive Quantization for LLM KV CacheNanjing UniversityShichen Dong, Wen Cheng, Jiayu Qin, Wei Wanghttps://arxiv.org/pdf/2403.046434) May 8, 2024KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled QuantizationRice University, Stevens Institute of Technology, ThirdAI Corp.Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, Anshumali Shrivastavahttps://arxiv.org/pdf/2405.039175) 2024DUOATTENTION: EFFICIENT LONG-CONTEXT LLM INFERENCE WITH RETRIEVAL AND STREAMING HEADSMIT, Tsinghua University, SJTU, University of Edinburgh, NVIDIAGuangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Hanhttps://arxiv.org/pdf/2410.108196) 2025MILLION: MasterIng Long-Context LLM Inference Via Outlier-Immunized KV Product QuaNtizationShanghai Jiao Tong University, Shanghai Qi Zhi Institute, Huawei Technologies Co., Ltd, China University of Petroleum-BeijingZongwu Wang, Peng Xu, Fangxin Liu, Yiwei Hu, Qingxiao Sun, Gezi Li, Cheng Li, Xuan Wang, Li Jiang, Haibing Guanhttps://arxiv.org/pdf/2504.03661