AI Post Transformers

Cortex: Semantic Knowledge Caching for Low-Latency LLM Agents


Listen Later

The February 3, 2026 research paper in collaboration between the National University of Singapore, USTC, University of Toronto and the Sea AI Lab introduces Cortex, a specialized caching system designed to address high latency and financial costs in LLM agent applications. Unlike standard models, agents frequently perform repetitive external data retrievals that lead to significant delays and expensive API fees. Cortex optimizes this process by using a semantic cache to store and reuse previous search results and tool calls. A key innovation is the semantic judge, which ensures cached data is still accurate and relevant before serving it to the user. To maximize hardware efficiency, the system co-locates the agent and the judge on a single GPU using priority-aware scheduling. Evaluations show that this approach can improve throughput by up to 3.6× while reducing operational expenses by over 90%. Source: February 3, 2026 Cortex: Achieving Low-Latency, Cost-Efficient Remote Data Access For LLM via Semantic-Aware Knowledge Caching National University of Singapore, USTC, University of Toronto, Sea AI Lab Chaoyi Ruan, Chao Bi, Kaiwen Zheng, Ziji Shi, Xinyi Wan, Jialin Li https://arxiv.org/pdf/2509.17360
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof