
Sign up to save your podcasts
Or


Paper Link: https://arxiv.org/abs/2603.16104
Summary:
This paper introduces Helium, a workflow-aware serving framework designed to optimize agentic workflows, which are sequences of interdependent Large Language Model (LLM) calls,. The authors argue that existing serving systems are inefficient because they optimize individual inference tasks in isolation and treat LLM calls as black-box functions, failing to capture the massive redundancy and cross-call dependencies inherent in multi-step agentic workloads,,,.
To address these inefficiencies, Helium applies classic data system principles to LLM serving through several key innovations:
• Query Plan Modeling: It represents agentic workflows as directed acyclic graphs (DAGs) where LLM calls are treated as first-class operators, allowing for global optimizations like plan pruning and common subgraph elimination,,,.
• Proactive Caching: Unlike traditional "passive" caching, Helium identifies static prompt prefixes during compilation to pre-warm KV caches and utilizes a global prompt cache to bypass redundant operator executions entirely,,,.
• Cache-Aware Scheduling: It employs a novel Templated Radix Tree (TRT) to model the global prefix hierarchy and dependencies, paired with a cost-based algorithm that schedules tasks to maximize KV cache reuse across multiple workers,,,.
Evaluation results show that Helium achieves up to a 1.56× speedup over state-of-the-art agent serving systems and up to a 39.5× reduction in latency compared to naive sequential execution, while strictly preserving semantic accuracy,,,.
By Yun WuPaper Link: https://arxiv.org/abs/2603.16104
Summary:
This paper introduces Helium, a workflow-aware serving framework designed to optimize agentic workflows, which are sequences of interdependent Large Language Model (LLM) calls,. The authors argue that existing serving systems are inefficient because they optimize individual inference tasks in isolation and treat LLM calls as black-box functions, failing to capture the massive redundancy and cross-call dependencies inherent in multi-step agentic workloads,,,.
To address these inefficiencies, Helium applies classic data system principles to LLM serving through several key innovations:
• Query Plan Modeling: It represents agentic workflows as directed acyclic graphs (DAGs) where LLM calls are treated as first-class operators, allowing for global optimizations like plan pruning and common subgraph elimination,,,.
• Proactive Caching: Unlike traditional "passive" caching, Helium identifies static prompt prefixes during compilation to pre-warm KV caches and utilizes a global prompt cache to bypass redundant operator executions entirely,,,.
• Cache-Aware Scheduling: It employs a novel Templated Radix Tree (TRT) to model the global prefix hierarchy and dependencies, paired with a cost-based algorithm that schedules tasks to maximize KV cache reuse across multiple workers,,,.
Evaluation results show that Helium achieves up to a 1.56× speedup over state-of-the-art agent serving systems and up to a 39.5× reduction in latency compared to naive sequential execution, while strictly preserving semantic accuracy,,,.