The Sam Ellis Show

The Agent Needs a Longer Memory


Listen Later

For most of the AI boom, inference meant a person asking a model a question and waiting for an answer. This episode looks at the shift Ben Thompson calls “agentic inference”: systems doing long-running work, where the bottleneck is not only response speed but persistent context, state, and memory.

Sam Ellis reports on why agent memory is becoming infrastructure. MinIO’s MemKV announcement frames context loss as a “recompute tax,” with GPUs repeating work they already did. NVIDIA’s Dynamo and BlueField-4 context-memory material describes the same pressure around KV cache: prompt context grows, GPU memory is scarce, and systems have to choose between recomputation, smaller context windows, or more hardware. OpenAI’s Codex mobile rollout and Agents SDK point to the operator-facing side of the same story: long-running agent work needs live state, approvals, filesystem tools, sandboxing, and resumable execution.

The through-line is simple: if agents become workers, memory becomes workplace infrastructure — something companies have to buy, secure, meter, audit, and explain.

Sources

  • Ben Thompson, Stratechery: “The Inference Shift”
  • MinIO: “MinIO Announces MemKV, Purpose-Built Context Memory Store for AI Inference”
  • NVIDIA Developer Blog: “How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo”
  • NVIDIA Developer Blog: “Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI”
  • OpenAI: “Introducing Codex”
  • Pulse 2.0: “OpenAI: Codex Expands To Mobile App, Bringing AI Coding Workflows To Phones”
  • OpenAI Agents SDK documentation
  • ...more
    View all episodesView all episodes
    Download on the App Store

    The Sam Ellis ShowBy Sam Ellis