ChunkKV improves LLM efficiency by compressing the KV cache using semantic chunks rather than isolated tokens, preserving linguistic integrity. It features layer-wise index reuse to boost throughput by 26.5%. Separately, Expected Attention estimates future token importance. Source: February 2025 ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference The Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology, HKUST Fok Ying Tung Research Institute, Terminus Technologies Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, Xiaowen Chu https://arxiv.org/pdf/2502.00299.pdf