AI Post Transformers

QVCache for Semantic Caching in ANN Search


Listen Later

This episode explores QVCache, a query-aware semantic cache designed to sit in front of any approximate nearest neighbor (ANN) backend and speed up vector search without significantly hurting recall. It explains why exact-match caching fails for embeddings, introduces the idea of temporal-semantic locality—where nearby-in-time queries are also nearby in embedding space—and argues that this pattern can let systems reuse recent ANN results instead of repeatedly paying the full latency and I/O cost of high-recall search. The discussion also grounds the paper in the broader vector retrieval landscape, covering recall@k, HNSW, Product Quantization, DiskANN, FAISS, and the role of vector databases in RAG and large-scale serving. Listeners would find it interesting for its practical systems focus: rather than proposing yet another index, the paper asks whether a backend-agnostic cache can deliver real speedups for production retrieval workloads.
Sources:
1. QVCache: A Query-Aware Vector Cache — Anıl Eren Göçer, Ioanna Tsakalidou, Hamish Nicholson, Kyoungmin Kim, Anastasia Ailamaki, 2026
http://arxiv.org/abs/2602.02057
2. A Survey on Nearest Neighbor Search Methods — Mohammad A. N. Arefin, et al. (survey literature varies by edition; commonly cited broad surveys include authors such as Li, Amsaleg, Houle, and others in the NNS literature), 2018
https://scholar.google.com/scholar?q=A+Survey+on+Nearest+Neighbor+Search+Methods
3. Product Quantization for Nearest Neighbor Search — Hervé Jégou, Matthijs Douze, Cordelia Schmid, 2011
https://scholar.google.com/scholar?q=Product+Quantization+for+Nearest+Neighbor+Search
4. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs — Yu. A. Malkov, D. A. Yashunin, 2018
https://scholar.google.com/scholar?q=Efficient+and+Robust+Approximate+Nearest+Neighbor+Search+Using+Hierarchical+Navigable+Small+World+Graphs
5. DiskANN: Fast Accurate Billion-Point Nearest Neighbor Search on a Single Node — Suhas Jayaram Subramanya, Devvrit, Rohan Kadekodi, Ravishankar Krishnaswamy, Harsha Vardhan Simhadri, 2019
https://scholar.google.com/scholar?q=DiskANN:+Fast+Accurate+Billion-Point+Nearest+Neighbor+Search+on+a+Single+Node
6. The FAISS Library — Jeff Johnson, Matthijs Douze, Hervé Jégou, 2021
https://scholar.google.com/scholar?q=The+FAISS+Library
7. Vespa: Serving Large-Scale Machine-Learned Relevance — Jon Bratseth and colleagues, 2023
https://scholar.google.com/scholar?q=Vespa:+Serving+Large-Scale+Machine-Learned+Relevance
8. pgvector: Open-Source Vector Similarity Search for Postgres — Andrew Kane, 2023
https://scholar.google.com/scholar?q=pgvector:+Open-Source+Vector+Similarity+Search+for+Postgres
9. Milvus: A Purpose-Built Vector Data Management System — Milvus/Zilliz engineering team and collaborators, 2021
https://scholar.google.com/scholar?q=Milvus:+A+Purpose-Built+Vector+Data+Management+System
10. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting — Yoav Freund, Robert E. Schapire, 1997
https://scholar.google.com/scholar?q=A+Decision-Theoretic+Generalization+of+On-Line+Learning+and+an+Application+to+Boosting
11. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization — John Duchi, Elad Hazan, Yoram Singer, 2011
https://scholar.google.com/scholar?q=Adaptive+Subgradient+Methods+for+Online+Learning+and+Stochastic+Optimization
12. Ad Click Prediction: a View from the Trenches — H. Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, et al., 2013
https://scholar.google.com/scholar?q=Ad+Click+Prediction:+a+View+from+the+Trenches
13. Bandit Algorithms for Website Optimization — John White, 2012
https://scholar.google.com/scholar?q=Bandit+Algorithms+for+Website+Optimization
14. The Case for Learned Index Structures — Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis, 2018
https://scholar.google.com/scholar?q=The+Case+for+Learned+Index+Structures
15. Semantic Caching and Query Processing — Qiong Luo, Jeffrey F. Naughton, Rajasekar Krishnamurthy, Pei Cao, and Yunrui Li, 2003
https://scholar.google.com/scholar?q=Semantic+Caching+and+Query+Processing
16. GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings — Zilliz / GPTCache contributors, 2023
https://scholar.google.com/scholar?q=GPTCache:+An+Open-Source+Semantic+Cache+for+LLM+Applications+Enabling+Faster+Answers+and+Cost+Savings
17. Adaptive Similarity Search Caching (or related similarity-caching theory cited as [12] and [38]) — As cited in the paper, Unknown from excerpt
https://scholar.google.com/scholar?q=Adaptive+Similarity+Search+Caching+(or+related+similarity-caching+theory+cited+as+[12]+and+[38])
18. SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search — Qi Chen, Bingbing Wang, et al., 2021
https://scholar.google.com/scholar?q=SPANN:+Highly-efficient+Billion-scale+Approximate+Nearest+Neighbor+Search
19. Optimizing SSD-Resident Graph Indexing for High-Throughput Vector Search — approx. VeloANN authors, exact author list not recoverable from snippet, recent (likely 2024-2025)
https://scholar.google.com/scholar?q=Optimizing+SSD-Resident+Graph+Indexing+for+High-Throughput+Vector+Search
20. Quake: Adaptive indexing for vector search — approx. Quake authors, exact author list not recoverable from snippet, recent (likely 2024-2025)
https://scholar.google.com/scholar?q=Quake:+Adaptive+indexing+for+vector+search
21. Vector Search for the Future: From Memory-Resident, Static Heterogeneous Storage, to Cloud-Native Architectures — approx. survey/tutorial authors, exact author list not recoverable from snippet, recent
https://scholar.google.com/scholar?q=Vector+Search+for+the+Future:+From+Memory-Resident,+Static+Heterogeneous+Storage,+to+Cloud-Native+Architectures
22. GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching — approx. GPT semantic cache authors, exact author list not recoverable from snippet, recent
https://scholar.google.com/scholar?q=GPT+Semantic+Cache:+Reducing+LLM+Costs+and+Latency+via+Semantic+Embedding+Caching
23. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3
24. AI Post Transformers: Sentence-BERT: Siamese Networks for Sentence Embeddings — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/sentence-bert-siamese-networks-for-sentence-embeddings/
25. AI Post Transformers: MEMRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/memrl-self-evolving-agents-via-runtime-reinforcement-learning-on-episodic/
26. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3
Interactive Visualization: QVCache for Semantic Caching in ANN Search
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof