AI: post transformers

DroidSpeak: Cross-LLM KV Cache Sharing


Listen Later

The provided text introduces DroidSpeak, a novel distributed Large Language Model (LLM) inference system designed to enhance the efficiency of compound AI systems. It addresses the challenge of reusing Key-Value (KV) caches across different LLMs that share the same architectural foundation, a problem current systems struggle with. DroidSpeak achieves significant throughput improvements and faster prefill times by selectively recomputing only a small, "critical" subset of KV cache layers, while reusing the rest, with negligible impact on quality. This selective recomputation, determined through an offline profiling stage, is further optimized by pipelining re-computation with KV cache loading, making it practical for multi-LLM workflows in distributed settings. The paper demonstrates DroidSpeak's robustness and benefits across various tasks and model pairs.


Source: Published July 2025

https://arxiv.org/pdf/2411.02820v4


...more
View all episodesView all episodes
Download on the App Store

AI: post transformersBy mcgrof