April 04, 2026

QVCache for Semantic Caching in ANN Search

38 minutes

This episode explores QVCache, a query-aware semantic cache designed to sit in front of any approximate nearest neighbor (ANN) backend and speed up vector search without significantly hurting recall. It explains why exact-match caching fails for embeddings, introduces the idea of temporal-semantic locality—where nearby-in-time queries are also nearby in embedding space—and argues that this pattern can let systems reuse recent ANN results instead of repeatedly paying the full latency and I/O cost of high-recall search. The discussion also grounds the paper in the broader vector retrieval landscape, covering recall@k, HNSW, Product Quantization, DiskANN, FAISS, and the role of vector databases in RAG and large-scale serving. Listeners would find it interesting for its practical systems focus: rather than proposing yet another index, the paper asks whether a backend-agnostic cache can deliver real speedups for production retrieval workloads.

Sources:

1. QVCache: A Query-Aware Vector Cache — Anıl Eren Göçer, Ioanna Tsakalidou, Hamish Nicholson, Kyoungmin Kim, Anastasia Ailamaki, 2026

http://arxiv.org/abs/2602.02057

2. A Survey on Nearest Neighbor Search Methods — Mohammad A. N. Arefin, et al. (survey literature varies by edition; commonly cited broad surveys include authors such as Li, Amsaleg, Houle, and others in the NNS literature), 2018

https://scholar.google.com/scholar?q=A+Survey+on+Nearest+Neighbor+Search+Methods

3. Product Quantization for Nearest Neighbor Search — Hervé Jégou, Matthijs Douze, Cordelia Schmid, 2011

https://scholar.google.com/scholar?q=Product+Quantization+for+Nearest+Neighbor+Search

4. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs — Yu. A. Malkov, D. A. Yashunin, 2018

https://scholar.google.com/scholar?q=Efficient+and+Robust+Approximate+Nearest+Neighbor+Search+Using+Hierarchical+Navigable+Small+World+Graphs

5. DiskANN: Fast Accurate Billion-Point Nearest Neighbor Search on a Single Node — Suhas Jayaram Subramanya, Devvrit, Rohan Kadekodi, Ravishankar Krishnaswamy, Harsha Vardhan Simhadri, 2019

https://scholar.google.com/scholar?q=DiskANN:+Fast+Accurate+Billion-Point+Nearest+Neighbor+Search+on+a+Single+Node

6. The FAISS Library — Jeff Johnson, Matthijs Douze, Hervé Jégou, 2021

https://scholar.google.com/scholar?q=The+FAISS+Library

7. Vespa: Serving Large-Scale Machine-Learned Relevance — Jon Bratseth and colleagues, 2023

https://scholar.google.com/scholar?q=Vespa:+Serving+Large-Scale+Machine-Learned+Relevance

8. pgvector: Open-Source Vector Similarity Search for Postgres — Andrew Kane, 2023

https://scholar.google.com/scholar?q=pgvector:+Open-Source+Vector+Similarity+Search+for+Postgres

9. Milvus: A Purpose-Built Vector Data Management System — Milvus/Zilliz engineering team and collaborators, 2021

https://scholar.google.com/scholar?q=Milvus:+A+Purpose-Built+Vector+Data+Management+System

10. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting — Yoav Freund, Robert E. Schapire, 1997

https://scholar.google.com/scholar?q=A+Decision-Theoretic+Generalization+of+On-Line+Learning+and+an+Application+to+Boosting

11. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization — John Duchi, Elad Hazan, Yoram Singer, 2011

https://scholar.google.com/scholar?q=Adaptive+Subgradient+Methods+for+Online+Learning+and+Stochastic+Optimization

12. Ad Click Prediction: a View from the Trenches — H. Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, et al., 2013

https://scholar.google.com/scholar?q=Ad+Click+Prediction:+a+View+from+the+Trenches

13. Bandit Algorithms for Website Optimization — John White, 2012

https://scholar.google.com/scholar?q=Bandit+Algorithms+for+Website+Optimization

14. The Case for Learned Index Structures — Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis, 2018

https://scholar.google.com/scholar?q=The+Case+for+Learned+Index+Structures

15. Semantic Caching and Query Processing — Qiong Luo, Jeffrey F. Naughton, Rajasekar Krishnamurthy, Pei Cao, and Yunrui Li, 2003

https://scholar.google.com/scholar?q=Semantic+Caching+and+Query+Processing

16. GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings — Zilliz / GPTCache contributors, 2023

https://scholar.google.com/scholar?q=GPTCache:+An+Open-Source+Semantic+Cache+for+LLM+Applications+Enabling+Faster+Answers+and+Cost+Savings

17. Adaptive Similarity Search Caching (or related similarity-caching theory cited as [12] and [38]) — As cited in the paper, Unknown from excerpt

https://scholar.google.com/scholar?q=Adaptive+Similarity+Search+Caching+(or+related+similarity-caching+theory+cited+as+[12]+and+[38])

18. SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search — Qi Chen, Bingbing Wang, et al., 2021

https://scholar.google.com/scholar?q=SPANN:+Highly-efficient+Billion-scale+Approximate+Nearest+Neighbor+Search

19. Optimizing SSD-Resident Graph Indexing for High-Throughput Vector Search — approx. VeloANN authors, exact author list not recoverable from snippet, recent (likely 2024-2025)

https://scholar.google.com/scholar?q=Optimizing+SSD-Resident+Graph+Indexing+for+High-Throughput+Vector+Search

20. Quake: Adaptive indexing for vector search — approx. Quake authors, exact author list not recoverable from snippet, recent (likely 2024-2025)

https://scholar.google.com/scholar?q=Quake:+Adaptive+indexing+for+vector+search

21. Vector Search for the Future: From Memory-Resident, Static Heterogeneous Storage, to Cloud-Native Architectures — approx. survey/tutorial authors, exact author list not recoverable from snippet, recent

https://scholar.google.com/scholar?q=Vector+Search+for+the+Future:+From+Memory-Resident,+Static+Heterogeneous+Storage,+to+Cloud-Native+Architectures

22. GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching — approx. GPT semantic cache authors, exact author list not recoverable from snippet, recent

https://scholar.google.com/scholar?q=GPT+Semantic+Cache:+Reducing+LLM+Costs+and+Latency+via+Semantic+Embedding+Caching

23. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3

24. AI Post Transformers: Sentence-BERT: Siamese Networks for Sentence Embeddings — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/sentence-bert-siamese-networks-for-sentence-embeddings/

25. AI Post Transformers: MEMRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/memrl-self-evolving-agents-via-runtime-reinforcement-learning-on-episodic/

26. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3

Interactive Visualization: QVCache for Semantic Caching in ANN Search

...more

View all episodes

By mcgrof