AI Post Transformers

Cache Mechanism for Agent RAG Systems


Listen Later

This episode explores a 2025 paper on cache management for agentic RAG systems, asking whether an annotation-free cache can preserve most of the value of a massive retrieval corpus while using far less storage and reducing latency. It explains how RAG, agent memory, vector databases, embeddings, and approximate nearest neighbor search fit together, arguing that retrieval performance is not just a modeling issue but a core systems constraint for real-world agents. The discussion situates the paper in the broader history of retrieval and agent research, from Word2Vec and BERT to Dense Passage Retrieval, ReAct, and FAISS, showing why externalized knowledge remains useful even as language models grow larger. Listeners would find it interesting because it focuses on a practical but consequential question: how to make retrieval-heavy AI agents cheaper, faster, and more deployable outside large cloud infrastructures.
Sources:
1. Cache Mechanism for Agent RAG Systems — Shuhang Lin, Zhencan Peng, Lingyao Li, Xiao Lin, Xi Zhu, Yongfeng Zhang, 2025
http://arxiv.org/abs/2511.02919
2. PlanRAG — Lee et al., 2024
https://scholar.google.com/scholar?q=PlanRAG
3. Generate-then-Ground — Shi et al., 2024
https://scholar.google.com/scholar?q=Generate-then-Ground
4. RAP — Kagaya et al., 2024
https://scholar.google.com/scholar?q=RAP
5. RAT — Wang et al., 2024
https://scholar.google.com/scholar?q=RAT
6. Mei et al. (system engineering / large knowledge repositories) — Mei et al., 2025
https://scholar.google.com/scholar?q=Mei+et+al.+(system+engineering+/+large+knowledge+repositories)
7. Guo et al. on RAG-powered agent architectures — Guo et al., 2025
https://scholar.google.com/scholar?q=Guo+et+al.+on+RAG-powered+agent+architectures
8. Long Context vs. RAG for LLMs: An Evaluation and Revisits — approx. recent LLM/RAG evaluation authors, 2024/2025
https://scholar.google.com/scholar?q=Long+Context+vs.+RAG+for+LLMs:+An+Evaluation+and+Revisits
9. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? — approx. recent long-context LLM systems authors, 2024/2025
https://scholar.google.com/scholar?q=Can+Long-Context+Language+Models+Subsume+Retrieval,+RAG,+SQL,+and+More?
10. Predicting Retrieval Utility and Answer Quality in Retrieval-Augmented Generation — approx. recent RAG evaluation/prediction authors, 2024/2025
https://scholar.google.com/scholar?q=Predicting+Retrieval+Utility+and+Answer+Quality+in+Retrieval-Augmented+Generation
11. Relevance Filtering for Embedding-Based Retrieval — approx. recent dense retrieval / IR authors, 2024/2025
https://scholar.google.com/scholar?q=Relevance+Filtering+for+Embedding-Based+Retrieval
12. Volatility-Driven Decay: Adaptive Memory Retention for RAG Systems Under Unknown Drift — approx. recent continual RAG / memory authors, 2025
https://scholar.google.com/scholar?q=Volatility-Driven+Decay:+Adaptive+Memory+Retention+for+RAG+Systems+Under+Unknown+Drift
13. On the Role of Long-Tail Knowledge in Retrieval Augmented Large Language Models — approx. recent RAG robustness authors, 2024/2025
https://scholar.google.com/scholar?q=On+the+Role+of+Long-Tail+Knowledge+in+Retrieval+Augmented+Large+Language+Models
14. Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge — approx. recent biomedical retrieval authors, 2024/2025
https://scholar.google.com/scholar?q=Graph-Based+Retriever+Captures+the+Long+Tail+of+Biomedical+Knowledge
15. FIT-RAG: Black-Box RAG with Factual Information and Token Reduction — approx. recent black-box RAG authors, 2024/2025
https://scholar.google.com/scholar?q=FIT-RAG:+Black-Box+RAG+with+Factual+Information+and+Token+Reduction
16. AI Post Transformers: QVCache for Semantic Caching in ANN Search — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-qvcache-for-semantic-caching-in-ann-sear-415304.mp3
17. AI Post Transformers: Episode: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3
18. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3
19. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3
20. AI Post Transformers: ColBERT and ColBERT v2 — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/colbert-and-colbert-v2/
Interactive Visualization: Cache Mechanism for Agent RAG Systems
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof