KubeFM

GPU Containers as a Service, with Landon Clipp


Listen Later

Running GPU workloads on Kubernetes sounds straightforward until you need to isolate multiple tenants on the same server. The moment you virtualize GPUs for security, you lose access to NVIDIA kernel drivers — and almost every tool in the ecosystem assumes those drivers exist.

Landon Clipp built a GPU-based Containers as a Service platform from scratch, solving each isolation layer — from kernel separation with Kata Containers + QEMU to NVLink fabric partitioning to network policies with Cilium/eBPF — and shares exactly what broke along the way.

In this interview:

  • Why standard NVIDIA tooling (GPU Operator) fails in multi-tenant setups, and how to use CDI with PCI topology scanning to make GPUs visible to Kubernetes without kernel drivers

  • How to partition the NVLink fabric between tenants using a trusted service VM running Fabric Manager, and why the physical PCIe wiring differs between Supermicro HGX and NVIDIA DGX systems

  • Why gVisor doesn't work for GPU workloads — NVIDIA's unstable ioctl ABI means Google has to update gVisor for every driver release, and they only support a handful of GPUs

  • What caused 8-GPU VMs to take 30+ minutes to boot, and the specific fixes (IOMMUFD, cold plugging, kernel upgrades) that brought it down to minutes

  • How Cilium network policies enforce tenant isolation at the Kubernetes identity level instead of fragile IP-based rules

Where Containers as a Service fits best: inference workloads where AI teams want to ship an OCI image without managing infrastructure or signing multi-million dollar cluster contracts.

Sponsor

This episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.

More info

  • Find all the links and info for this episode here: https://ku.bz/jjK_yJTDz

  • Interested in sponsoring an episode? Learn more.

...more
View all episodesView all episodes
Download on the App Store

KubeFMBy KubeFM

  • 5
  • 5
  • 5
  • 5
  • 5

5

2 ratings


More shows like KubeFM

View all
Software Engineering Radio - the podcast for professional software developers by team@se-radio.net (SE-Radio Team)

Software Engineering Radio - the podcast for professional software developers

273 Listeners

The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

288 Listeners

Security Now (Audio) by TWiT

Security Now (Audio)

2,011 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

626 Listeners

LINUX Unplugged by Jupiter Broadcasting

LINUX Unplugged

275 Listeners

The Reasoning Show by Massive Studios

The Reasoning Show

154 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

583 Listeners

Soft Skills Engineering by Jamison Dance and Dave Smith

Soft Skills Engineering

287 Listeners

Thoughtworks Technology Podcast by Thoughtworks

Thoughtworks Technology Podcast

44 Listeners

Late Night Linux by The Late Night Linux Family

Late Night Linux

168 Listeners

Kubernetes Podcast from Google by Abdel Sghiouar, Kaslin Fields

Kubernetes Podcast from Google

180 Listeners

AWS Podcast by Amazon Web Services

AWS Podcast

204 Listeners

The Stack Overflow Podcast by The Stack Overflow Podcast

The Stack Overflow Podcast

63 Listeners

2.5 Admins by The Late Night Linux Family

2.5 Admins

98 Listeners

Oxide and Friends by Oxide Computer Company

Oxide and Friends

67 Listeners