
Sign up to save your podcasts
Or


You're running gRPC services in Kubernetes, load balancing looks fine on the dashboard — but some pods are burning at 80% CPU while others sit idle, and adding more replicas only partially helps.
Rohit Agrawal, a Staff Software Engineer on the traffic platform team at Databricks, explains why this happens and how his team replaced Kubernetes's default networking with a proxy-less, client-side load-balancing system built on the xDS protocol.
In this episode:
Why KubeProxy's Layer 4 routing breaks down under high-throughput gRPC: it picks a backend once per TCP connection, not per request
How Databricks built an Endpoint Discovery Service (EDS) that watches Kubernetes directly and streams real-time pod metadata to every client
How zone-aware spillover cut cross-availability-zone costs without sacrificing availability
Why CPU-based routing failed (monitoring lag creates oscillation) and what signals to use instead
The system has been running in production for three years across hundreds of services, handling millions of requests.
Sponsor
This episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.
More info
Find all the links and info for this episode here: https://ku.bz/y803JMhBk
Interested in sponsoring an episode? Learn more.
By KubeFM5
22 ratings
You're running gRPC services in Kubernetes, load balancing looks fine on the dashboard — but some pods are burning at 80% CPU while others sit idle, and adding more replicas only partially helps.
Rohit Agrawal, a Staff Software Engineer on the traffic platform team at Databricks, explains why this happens and how his team replaced Kubernetes's default networking with a proxy-less, client-side load-balancing system built on the xDS protocol.
In this episode:
Why KubeProxy's Layer 4 routing breaks down under high-throughput gRPC: it picks a backend once per TCP connection, not per request
How Databricks built an Endpoint Discovery Service (EDS) that watches Kubernetes directly and streams real-time pod metadata to every client
How zone-aware spillover cut cross-availability-zone costs without sacrificing availability
Why CPU-based routing failed (monitoring lag creates oscillation) and what signals to use instead
The system has been running in production for three years across hundreds of services, handling millions of requests.
Sponsor
This episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.
More info
Find all the links and info for this episode here: https://ku.bz/y803JMhBk
Interested in sponsoring an episode? Learn more.

273 Listeners

288 Listeners

2,011 Listeners

626 Listeners

275 Listeners

154 Listeners

583 Listeners

287 Listeners

44 Listeners

168 Listeners

180 Listeners

204 Listeners

63 Listeners

98 Listeners

67 Listeners