AI Papers: A Deep Dive

Why Your Coding Agent Stalls While the GPU Runs Hot


Listen Later

Why Your Coding Agent Stalls While the GPU Runs Hot

Source: MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems

Paper was published on April 14, 2026

This episode was AI-generated on May 3, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

Modern LLM serving stacks were built for chatbots, and agents are quietly breaking them — pinning GPUs at full utilization while users wait minutes for replies. A new paper from Duke argues the fix isn't bigger hardware but borrowing scheduling ideas from 1970s operating systems, and the measured speedups are hard to ignore.

Key Takeaways
  • Why throughput dashboards lie for agent workloads, and what 'goodput' — finishing within a multiple of a task's ideal time — actually measures
  • The two pathologies that crater agent latency: KV cache thrashing during tool pauses, and CPU-GPU coupling that strands GPU capacity
  • How MARS unifies scheduling and KV eviction under one priority order using a multi-level feedback queue lifted straight from classical OS design
  • The headline numbers — up to 5.94x mean latency reduction on a controlled testbed, but only ~1.87x in a real OpenHands deployment — and why the gap matters
  • Where the paper's framing is generously tuned: an alpha-of-three success bar, single-GPU experiments, baselines reimplemented inside MARS's stack, and a constructed long-context workload
  • The broader shift the paper represents: LLM serving professionalizing into systems research, with sessions-as-processes and KV-cache-as-virtual-memory as the new vocabulary
    • 00:00 — The busy-GPU, broken-agent puzzle
      Setting up the gap between healthy serving dashboards and unresponsive agents, and why three assumptions baked into chat-era serving no longer hold.
    • 02:59 — Throughput vs. goodput
      Defining the metric the rest of the paper rests on — completion within a scaled time budget — and the chart showing baseline goodput collapsing while throughput stays high.
    • 05:58 — Two pathologies: KV thrashing and CPU-GPU coupling
      Why static keep-or-evict decisions on enormous KV caches fail, and how tool-blocked sessions strand GPU capacity while the CPU is hammered.
    • 08:58 — Inside MARS: observability, admission control, scheduling
      Walking the three-layer architecture, the AIMD admission window, and the multi-level feedback queue that unifies scheduling decisions with KV eviction priority.
    • 11:57 — The chunk-shrinking trick and other small cleverness
      How MARS converts hard preemption failures into graceful slowdowns, plus the modesty of the implementation — about 5,000 lines on top of vLLM.
    • 14:56 — What the numbers actually show
      Separating the controlled-testbed ceiling from the real-deployment gain, and the eviction-rate graph that captures the difference between thrashing and pacing.
    • 17:56 — Where the paper reaches
      Critiquing the alpha-of-three success bar, reimplemented baselines, single-GPU experiments, curated workload, and the regime where MARS's own co-scheduler hurts.
    • 20:55 — Serving as systems research
      Situating MARS within a broader shift toward OS-style framings of LLM inference, and what that means for agent builders and the field's evaluation vocabulary.
    • Recommended Reading
      • Efficient Memory Management for Large Language Model Serving with PagedAttention — The vLLM paper that MARS builds on top of — essential context for understanding the KV cache block allocator that MARS's eviction policy operates over.
      • Autellix: An Efficient Serving Engine for LLM Agents as General Programs — The program-aware scheduler MARS positions itself against — the episode frames it as 'correct about logical structure, blind to physical resources,' so reading it directly clarifies what MARS adds.
      • MemGPT: Towards LLMs as Operating Systems — A kindred-spirit system in the OS-vocabulary-for-LLMs lineage the episode highlights, treating context management as virtual memory rather than a serving detail.
      • ...more
        View all episodesView all episodes
        Download on the App Store

        AI Papers: A Deep DiveBy paperdive.ai