How can pre-computing and reusing Key-Value (KV) caches accelerate inference for Retrieval-Augmented Generation and other long-context LLM tasks?
The provided sources identify the same core problem—high latency in Large Language Model (LLM) inference due to processing long, repetitive contexts—and converge on a unified solution: leveraging pre-computed Key-Value (KV) caches. Each source then contributes a unique perspective on *how* to implement this solution effectively, addressing specific challenges that arise from this approach.
The unified answer proposed by all sources is to avoid redundant computation by pre-computing, storing, and reusing the KV caches of recurring text segments (referred to as chunks, documents, or prompt modules).
Sources:
https://arxiv.org/html/2502.15734v1
https://arxiv.org/html/2412.15605v1
https://arxiv.org/html/2502.16002v1
https://arxiv.org/html/2310.07240v6
https://arxiv.org/pdf/2404.12457
https://openreview.net/pdf?id=x7NbaU8RSU
https://proceedings.mlsys.org/paper_files/paper/2024/file/a66caa1703fe34705a4368c3014c1966-Paper-Conference.pdf
https://www.cs.princeton.edu/~ravian/COS597_F24/papers/cacheblend.pdf