
Sign up to save your podcasts
Or


Curious about running AI models on Kubernetes without breaking the bank? This episode delivers practical insights from someone who's done it successfully at scale.
John McBride, VP of Infrastructure and AI Engineering at the Linux Foundation shares how his team at OpenSauced built StarSearch, an AI feature that uses natural language processing to analyze GitHub contributions and provide insights through semantic queries. By using open-source models instead of commercial APIs, the team saved tens of thousands of dollars.
You will learn:
How to deploy VLLM on Kubernetes to serve open-source LLMs like Mistral and Llama, including configuration challenges with GPU drivers and daemon sets
Why smaller models (7-14B parameters) can achieve 95% effectiveness for many tasks compared to larger commercial models, with proper prompt engineering
How running inference workloads on your own infrastructure with T4 GPUs can reduce costs from tens of thousands to just a couple thousand dollars monthly
Practical approaches to monitoring GPU workloads in production, including handling unpredictable failures and VRAM consumption issues
Sponsor
This episode is brought to you by StackGen! Don't let infrastructure block your teams. StackGen deterministically generates secure cloud infrastructure from any input - existing cloud environments, IaC or application code.
More info
Find all the links and info for this episode here: https://ku.bz/wP6bTlrFs
Interested in sponsoring an episode? Learn more.
By KubeFM5
22 ratings
Curious about running AI models on Kubernetes without breaking the bank? This episode delivers practical insights from someone who's done it successfully at scale.
John McBride, VP of Infrastructure and AI Engineering at the Linux Foundation shares how his team at OpenSauced built StarSearch, an AI feature that uses natural language processing to analyze GitHub contributions and provide insights through semantic queries. By using open-source models instead of commercial APIs, the team saved tens of thousands of dollars.
You will learn:
How to deploy VLLM on Kubernetes to serve open-source LLMs like Mistral and Llama, including configuration challenges with GPU drivers and daemon sets
Why smaller models (7-14B parameters) can achieve 95% effectiveness for many tasks compared to larger commercial models, with proper prompt engineering
How running inference workloads on your own infrastructure with T4 GPUs can reduce costs from tens of thousands to just a couple thousand dollars monthly
Practical approaches to monitoring GPU workloads in production, including handling unpredictable failures and VRAM consumption issues
Sponsor
This episode is brought to you by StackGen! Don't let infrastructure block your teams. StackGen deterministically generates secure cloud infrastructure from any input - existing cloud environments, IaC or application code.
More info
Find all the links and info for this episode here: https://ku.bz/wP6bTlrFs
Interested in sponsoring an episode? Learn more.

273 Listeners

290 Listeners

2,004 Listeners

624 Listeners

265 Listeners

153 Listeners

587 Listeners

284 Listeners

42 Listeners

165 Listeners

181 Listeners

202 Listeners

62 Listeners

99 Listeners

59 Listeners