July 14, 2025

PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications

13 minutes

The research introduces PrefillOnly, a novel inference engine specifically designed for Large Language Models (LLMs) used in discriminative tasks, where only a single output token is generated. Unlike traditional LLM engines optimized for variable-length outputs, PrefillOnly significantly reduces GPU memory consumption by only storing the Key-Value (KV) cache of the last computed layer and by using hybrid prefilling to manage intermediate tensor sizes. Furthermore, its Job Completion Time (JCT)-aware scheduling continuously calibrates based on prefix cache hits, leading to improved throughput and reduced latency, outperforming existing solutions in these specific workloads. This approach paves the way for more efficient deployment of LLMs in applications like recommendations and credit verification.

...more

View all episodes

By Enoch H. Kang

July 14, 2025

PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications

13 minutes

...more

Share PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications

Sign up to save your podcasts

PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications

PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications