High Output: The Future of Engineering

Stop starving your GPUs—with Jaikumar Ganesh


Listen Later

When you provision thousands of GPU clusters weekly for Apple, Spotify, OpenAI, Uber, Runway, and Cursor, you see something nobody else does. From the infrastructure layer, the patterns are unmistakable. Different companies, different products, different use cases—but they’re all hitting the same bottlenecks.

Jaikumar Ganesh would know. As Head of Engineering at Anyscale, he runs Ray—the distributed compute engine that powers production AI at scale. Ray sits in the stack between Kubernetes and the AI workloads, orchestrating the compute that makes everything run. When you’re that deep in everyone’s infrastructure, you see the convergence before anyone else does.

And what he sees right now? Companies are throwing money at GPU shortages while their GPUs sit idle half the time, waiting for CPUs to finish resizing images. It’s not a GPU problem. It’s a coordination problem. And it’s just one of several patterns everyone’s hitting—patterns most don’t even realize are shared.

🎧 Subscribe and listen now →

The bottlenecks everyone’s hitting

Here’s what actually happens in production. You need to process multimodal data—audio, video, robotics sensors, Zoom recordings. Reading images and resizing them? CPUs. LLM inference? GPUs. Writing results back? CPUs again. These are staged pipelines.

“In a lot of legacy systems, you read an image and you have to wait till all the images are read till you activate the GPU,” JK explains. “Now what happens is there’s a GPU shortage, but you have a GPU sitting there idle, and so your GPU utilization is low and your finance team is like, ‘Hey, you’re spending so much money.’”

This is the shift that sounds obvious but isn’t: we’ve moved from a CPU-centric world to a heterogeneous compute world—CPU plus GPU. Most frameworks were built for one or the other. Very few handle the handoff well. Ray Data solves this by handling the transitions without writing to disk at every stage. Different pipeline stages execute on the right resource, and nothing sits waiting.

The companies that figure this out have massive cost advantages. The ones that don’t keep throwing money at GPU clusters that spend half their time idle.

But here’s what’s remarkable: when you’re provisioning clusters at this scale, you see more than just the GPU coordination problem. You see the entire stack converging. Pull up any major AI company’s infrastructure and you’ll see the same architecture:

At the top: AI workloads (data processing, pre-training, post-training, model serving)

Below that: Training frameworks (PyTorch, JAX)

Then: LLM-specific engines (VLLM for serving, DeepSpeed and FSDP for parallelism)

Distributed compute: Ray

Container orchestration: Kubernetes

At the bottom: Cloud providers and GPU providers

“Across all the companies we have worked with, in open source as well as those who are Anyscale customers, this pattern is consistent,” JK explains.

Here are the four patterns driving convergence:

* Heterogeneous compute coordination: CPU-centric thinking doesn’t work anymore. You need CPU and GPU working together efficiently. Most frameworks handle one or the other well, but the handoff between them is where money gets burned. Multimodal data processing—audio, video, sensor data—exposes this immediately.

* Post-training infrastructure complexity: Everyone thinks pre-training is the hard part. Wrong. Post-training is where the real infrastructure complexity lives, and it’s where customization happens. Eight of the ten most popular open source post-training libraries are built on Ray. Why? Because you need inference stages mixed with training stages, all within the same workload. Someone has to orchestrate where each stage runs, whether to transfer model weights, how to handle the compute efficiently.

* Multimodal data pipeline bottlenecks: It’s not a model problem—it’s an engineering problem. The bottleneck isn’t which model handles video best. It’s moving data between CPUs and GPUs efficiently without writing to disk at every stage. Fix the pipeline, not the model selection.

* Domain-specific approaches returning: While everyone obsessed over LLMs, reinforcement learning quietly came back in gaming and simulation. Riot Games—one of Ray’s largest customers—uses RL to power the models behind their characters. When you have a physical world or game environment to model, RL still wins. Different problems need different approaches. They all need the same underlying infrastructure to scale.

The interesting part isn’t that everyone uses similar tools. It’s that the bottlenecks are identical. They’re all hitting the same walls—and most of them think they’re the only ones.

Where the real moat lives

Pre-training gets the headlines and the hype. Post-training is where the actual differentiation happens.

Think about it: pre-training is increasingly commoditized. You can use foundation models from OpenAI, Anthropic, or Meta. But post-training—fine-tuning models for your specific use case, your specific data, your specific product needs—that’s where you build something defensible.

And post-training infrastructure is brutally complex. You need inference stages mixed with training stages. You’re constantly moving between different compute resources. You’re orchestrating model weight transfers. You’re debugging why your pipeline breaks at 2am.

This is why eight of the ten most popular open source post-training libraries are built on Ray. Anthropic Claude uses them. Cursor’s agents use them. Not because Ray is magic, but because orchestrating this complexity requires infrastructure built specifically for heterogeneous compute.

“They all use post-training libraries, and someone has to orchestrate and handle compute efficiently,” JK explains. “You can have inference stage, your training stage within the post-training libraries itself, and there’ll be a lot of complexity around where each one of these stages runs.”

Your competitors are using the same foundation models. They’re reading the same papers. The differentiation isn’t in the base technology—it’s in how efficiently you can customize it for your needs. That’s an infrastructure problem, not a model problem.

The distance that creates the view

JK’s ability to see these patterns comes from somewhere specific. He grew up in a classroom with one other student—not a small private school, but a remote village in India where it took 10 days to receive a telegram about his grandmother’s death. The world had phones. His village didn’t.

Years later, visiting relatives in the city, he saw a two-line pager clipped to his uncle’s belt. “I was like, whoa, what the hell is this?” he recalls. He asked which company made it. Motorola. “That’s where I want to be.”

That distance from infrastructure—then getting close to it—shapes how you think about abstraction layers. He joined Motorola during its decline, then landed on the early Android team at Google when they were 10-15 people figuring out what they were building. Then co-started Uber’s AI group. Then Anyscale.

JK has been in this position before: seeing the platform-level patterns emerge while individual companies think they’re solving unique problems. The moment that crystallized it happened on a bus in Panama in 2012. A local spent the entire ride on WhatsApp. JK asked what he was doing for so long. “He kind of gave me this look saying, dude, what a stupid question,” JK remembers. “He just said that this has allowed me to keep in touch with my family in remote village.”

From Android enabling that Panama bus connection to Ray enabling AI at scale—JK’s entire career has been about building the infrastructure layer that lets others build. And that vantage point is what lets him see the convergence happening now.

The honest take on AI coding agents

Anyscale is targeting 30% productivity gains from AI coding tools. Not 10x. Not zero. Thirty percent.

That’s the honest number—and it’s harder to achieve than you’d think.

JK tried an experiment a year ago: fed Ray code to an AI agent without reviewing it, just to see what would happen. His cluster crashed. He spent the next two hours debugging why. The agent had written code that consumed too much memory, causing out-of-memory errors.

This is someone who runs production AI infrastructure at massive scale. Even he can’t blindly trust AI-generated code.

What works instead: spec-driven development. Detailed markdown files. Clear design documents. Tell the agent exactly what you want—function length limits, testing requirements, how to use specific libraries. Then review what it produces.

“You cannot just vibe-code a system into production,” he explains. “I’ve seen engineers who say, ‘Oh, I just used agents for this,’ and they’ll look at the crap it has produced and it’s caused me more problems.”

But here’s where it gets interesting. One of Anyscale’s senior engineers was firmly in the anti-LLM camp. Then he worked on a complicated problem with a distinguished member of the staff who used agents effectively. Two days later, JK got a Slack message: “Couldn’t have produced this code in three days. It would’ve taken me two weeks.”

The pattern? Senior engineers who’ve been through previous platform shifts (internet, mobile, cloud) adapt faster. They recognize tectonic change when they see it. Junior engineers coming in fresh adapt quickly too—they haven’t developed rigid workflows yet. It’s the engineers in the middle who struggle most.

The critical part: humans stay in the loop. “We do not want to be completely dependent on agents that we lose the critical thinking part,” JK emphasizes. “You need to understand your code at a deep level. You need to understand your design at a deep level and then let agents do their thing.”

That’s the realistic target: 30% productivity gains with spec-driven development, human review, and clear expectations. Anyone claiming 10x either hasn’t shipped to production or is starting from scratch.

What this actually means

AI coding agents can deliver 30%—if you do it right. Write detailed specs. Review everything. Keep humans in the loop. The mechanical coding work gets faster. The thinking work stays yours. Anyone claiming more is overselling.

The heterogeneous compute insight matters right now. If you’re still thinking “just rent more GPUs,” you’re going to blow your budget. The companies that figure out CPU+GPU orchestration will have massive cost advantages. Your finance team already sees the waste—idle GPUs waiting for CPU tasks. Fix the pipeline efficiency, not the GPU count.

Post-training is where the moat lives. Pre-training gets the headlines. Post-training is where customization happens. Eight of ten popular libraries are built on the same infrastructure because orchestrating mixed inference and training stages is brutally complex. If you’re building serious AI products, understand your post-training infrastructure.

Multimodal is an infrastructure problem, not a model problem. Everyone’s focused on which model handles video best. The real bottleneck is data pipeline efficiency—moving between CPUs and GPUs without writing to disk at every stage. This is an engineering challenge, not a model selection challenge.

Your competitors are converging on the same stack. The differentiation isn’t in the infrastructure layer—it’s in what you build on top of it. Don’t waste time reinventing distributed compute. Use the patterns that already work at scale.

What patterns are you seeing in your AI deployments? Are you burning budget on GPU utilization? Hitting data pipeline bottlenecks trying to process multimodal data? Reply to this email—I’m curious what matches these infrastructure-layer patterns.

About Anyscale:

Anyscale is the company behind Ray, the open-source distributed compute engine powering production AI at companies like OpenAI, Anthropic, Cursor, Apple, Spotify, Netflix, and Uber. Founded by the creators of Ray, Anyscale gives engineering teams a platform to scale ML and AI workloads—from data processing to training to inference—without manual infrastructure operations.

The platform helps companies deploy fault-tolerant clusters, optimize GPU utilization, and scale across CPUs, GPUs, and other accelerators. Whether you’re fine-tuning LLMs, running batch inference, or processing video at scale, Anyscale handles the distributed compute complexity so teams can focus on building AI products.

Learn more at anyscale.com.

High Output is brought to you by Maestro AI:

Maestro AI is an engineering visibility platform that helps leaders make data-driven decisions backed by narrative context. While most dashboards offer surface-level metrics, Maestro analyzes your team’s actual code, PRs, tickets, and communications to reveal not just what’s happening, but why.

The platform automatically synthesizes this activity into real-time feeds for every project, team, and individual—replacing subjective status meetings with objective truth. This allows you to identify blockers before they impact deadlines, de-risk key initiatives, and measure the true impact of tools like AI on your organization.

Visit https://getmaestro.ai to see how we help engineering leaders build more predictable and efficient organizations.

Leading distributed engineering teams? We’d love to hear your challenges. Schedule a chat with our team → https://getmaestro.ai/book



This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit maestroai.substack.com
...more
View all episodesView all episodes
Download on the App Store

High Output: The Future of EngineeringBy Maestro AI