May 03, 2026

Why Your Coding Agent Stalls While the GPU Runs Hot

23 minutes

Source: MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems

Paper was published on April 14, 2026

This episode was AI-generated on May 3, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

Modern LLM serving stacks were built for chatbots, and agents are quietly breaking them — pinning GPUs at full utilization while users wait minutes for replies. A new paper from Duke argues the fix isn't bigger hardware but borrowing scheduling ideas from 1970s operating systems, and the measured speedups are hard to ignore.

Key Takeaways

Why throughput dashboards lie for agent workloads, and what 'goodput' — finishing within a multiple of a task's ideal time — actually measures

The two pathologies that crater agent latency: KV cache thrashing during tool pauses, and CPU-GPU coupling that strands GPU capacity

How MARS unifies scheduling and KV eviction under one priority order using a multi-level feedback queue lifted straight from classical OS design

The headline numbers — up to 5.94x mean latency reduction on a controlled testbed, but only ~1.87x in a real OpenHands deployment — and why the gap matters

Where the paper's framing is generously tuned: an alpha-of-three success bar, single-GPU experiments, baselines reimplemented inside MARS's stack, and a constructed long-context workload

The broader shift the paper represents: LLM serving professionalizing into systems research, with sessions-as-processes and KV-cache-as-virtual-memory as the new vocabulary

00:00 — The busy-GPU, broken-agent puzzle
Setting up the gap between healthy serving dashboards and unresponsive agents, and why three assumptions baked into chat-era serving no longer hold.

02:59 — Throughput vs. goodput
Defining the metric the rest of the paper rests on — completion within a scaled time budget — and the chart showing baseline goodput collapsing while throughput stays high.

05:58 — Two pathologies: KV thrashing and CPU-GPU coupling
Why static keep-or-evict decisions on enormous KV caches fail, and how tool-blocked sessions strand GPU capacity while the CPU is hammered.

08:58 — Inside MARS: observability, admission control, scheduling
Walking the three-layer architecture, the AIMD admission window, and the multi-level feedback queue that unifies scheduling decisions with KV eviction priority.

11:57 — The chunk-shrinking trick and other small cleverness
How MARS converts hard preemption failures into graceful slowdowns, plus the modesty of the implementation — about 5,000 lines on top of vLLM.

14:56 — What the numbers actually show
Separating the controlled-testbed ceiling from the real-deployment gain, and the eviction-rate graph that captures the difference between thrashing and pacing.

17:56 — Where the paper reaches
Critiquing the alpha-of-three success bar, reimplemented baselines, single-GPU experiments, curated workload, and the regime where MARS's own co-scheduler hurts.

20:55 — Serving as systems research
Situating MARS within a broader shift toward OS-style framings of LLM inference, and what that means for agent builders and the field's evaluation vocabulary.