April 25, 2025

Sleep-Time Compute: Pre-computation for Efficient LLM Inference

11 minutes

This research introduces "sleep-time compute," a novel method for enhancing large language model efficiency by allowing them to process contextual information offline, before user queries arrive. By anticipating potential questions and pre-computing relevant inferences, this approach significantly reduces the computational resources and latency needed at test time to achieve comparable or even better accuracy on reasoning tasks. The study demonstrates that sleep-time compute can lead to substantial savings in test-time compute and can be further amplified by scaling the offline processing or by applying it to multiple related queries sharing the same context. Moreover, the effectiveness of sleep-time compute is strongly correlated with how predictable the user's query is based on the available context, suggesting strategic application for maximum benefit.

...more

View all episodes

By Neural Intelligence Network

April 25, 2025

Sleep-Time Compute: Pre-computation for Efficient LLM Inference

11 minutes

...more

Share Sleep-Time Compute: Pre-computation for Efficient LLM Inference

Sign up to save your podcasts

Sleep-Time Compute: Pre-computation for Efficient LLM Inference

Sleep-Time Compute: Pre-computation for Efficient LLM Inference