April 10, 2026

We Cut LLM Latency by 70% in Production

Listen Later

1 hour 5 minutes

Maher Hanafi is an engineering leader who went from zero AI experience to self-hosting LLMs at enterprise scale — managing GPU costs, optimizing inference with TensorRT LLM, and building an AI platform for HR tech. In this conversation, he breaks down exactly how his team cut latency by 70%, reduced GPU spend through counterintuitive scaling strategies, and navigated the messy reality of taking AI from proof-of-concept to production.

How We Cut LLM Latency 70% With TensorRT in Production // MLOps Podcast #369 with Maher Hanafi, SVP of Engineering at Betterworks

Key topics covered:

The AI Iceberg — Why the invisible work behind AI (performance, latency, throughput, cost, accuracy) is harder than building the features themselves

GPU Cost Optimization — How upgrading to more expensive GPUs actually saved money by reducing total runtime hours

TensorRT LLM Deep Dive — Rewiring neural networks to match GPU architecture for 50-70% latency reduction

Cold Start Solutions — Using AWS FSx, baking models into container images, and cutting minutes off spin-up times

KV Cache & In-Flight Batching — Why using one model per GPU with maximum KV cache beats cramming multiple models together

Scheduled & Dynamic Scaling — Pattern-based scaling for HR tech workloads (nights, weekends, end-of-quarter spikes)

Verticalized AI Platform — Building horizontal AI infrastructure that serves multiple HR product verticals

AI Engineering Lab — How junior vs. senior engineers adopted AI coding tools differently, and the cultural shift that followed

Agentic Coding in Practice — Navigating AI coding agent costs, quality control, and redefining the SDLC

Chinese Models & Compliance — Why enterprise customers block DeepSeek/Qwen and the geopolitics of model training data

This episode is for engineering leaders building AI in production, MLOps engineers optimizing GPU infrastructure, and anyone navigating the gap between AI demos and enterprise-scale deployment.

Links & Resources:

TensorRT LLM: https://github.com/NVIDIA/TensorRT-LLM

NVIDIA Run: ai Model Streamer (cold start optimization): https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/

vLLM vs TensorRT-LLM comparison: https://northflank.com/blog/vllm-vs-tensorrt-llm-and-how-to-run-them

Timestamps:

[00:00] Optimizing GPU Usage and Latency

[00:21] Learning AI as Leadership

[04:34] AI Cost Centers

[13:56] Throughput and Infrastructure Efficiency

[18:10] Scaling and Unit Economics

[24:14] Championing AI ROI

[36:11] Queue to Value Engine

[41:30] Failed Product Features

[46:12] Agentic Engineering Costs

[58:49] AI Self-Hosting in Engineering

[1:04:40] Wrap up

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

MLOps.community

By Demetrios

4.6

2323 ratings

April 10, 2026

We Cut LLM Latency by 70% in Production

Listen Later

1 hour 5 minutes

Maher Hanafi is an engineering leader who went from zero AI experience to self-hosting LLMs at enterprise scale — managing GPU costs, optimizing inference with TensorRT LLM, and building an AI platform for HR tech. In this conversation, he breaks down exactly how his team cut latency by 70%, reduced GPU spend through counterintuitive scaling strategies, and navigated the messy reality of taking AI from proof-of-concept to production.

How We Cut LLM Latency 70% With TensorRT in Production // MLOps Podcast #369 with Maher Hanafi, SVP of Engineering at Betterworks

Key topics covered:

The AI Iceberg — Why the invisible work behind AI (performance, latency, throughput, cost, accuracy) is harder than building the features themselves

GPU Cost Optimization — How upgrading to more expensive GPUs actually saved money by reducing total runtime hours

TensorRT LLM Deep Dive — Rewiring neural networks to match GPU architecture for 50-70% latency reduction

Cold Start Solutions — Using AWS FSx, baking models into container images, and cutting minutes off spin-up times

KV Cache & In-Flight Batching — Why using one model per GPU with maximum KV cache beats cramming multiple models together

Scheduled & Dynamic Scaling — Pattern-based scaling for HR tech workloads (nights, weekends, end-of-quarter spikes)

Verticalized AI Platform — Building horizontal AI infrastructure that serves multiple HR product verticals

AI Engineering Lab — How junior vs. senior engineers adopted AI coding tools differently, and the cultural shift that followed

Agentic Coding in Practice — Navigating AI coding agent costs, quality control, and redefining the SDLC

Chinese Models & Compliance — Why enterprise customers block DeepSeek/Qwen and the geopolitics of model training data

This episode is for engineering leaders building AI in production, MLOps engineers optimizing GPU infrastructure, and anyone navigating the gap between AI demos and enterprise-scale deployment.

Links & Resources:

TensorRT LLM: https://github.com/NVIDIA/TensorRT-LLM

NVIDIA Run: ai Model Streamer (cold start optimization): https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/

vLLM vs TensorRT-LLM comparison: https://northflank.com/blog/vllm-vs-tensorrt-llm-and-how-to-run-them

Timestamps:

[00:00] Optimizing GPU Usage and Latency

[00:21] Learning AI as Leadership

[04:34] AI Cost Centers

[13:56] Throughput and Infrastructure Efficiency

[18:10] Scaling and Unit Economics

[24:14] Championing AI ROI

[36:11] Queue to Value Engine

[41:30] Failed Product Features

[46:12] Agentic Engineering Costs

[58:49] AI Self-Hosting in Engineering

[1:04:40] Wrap up

...more

More shows like MLOps.community

This Week in Startups by Jason Calacanis

This Week in Startups

1,290 Listeners

The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

288 Listeners

The a16z Show by Andreessen Horowitz

The a16z Show

1,096 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

624 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

583 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

301 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

344 Listeners

Practical AI by Practical AI LLC

Practical AI

213 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

561 Listeners

Big Technology Podcast by Alex Kantrowitz

Big Technology Podcast

507 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

145 Listeners

Latent Space: The AI Engineer Podcast by Latent.Space

Latent Space: The AI Engineer Podcast

100 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

227 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

693 Listeners

AI + a16z by a16z

AI + a16z

32 Listeners