MLOps.community

We Cut LLM Latency by 70% in Production


Listen Later

Maher Hanafi is an engineering leader who went from zero AI experience to self-hosting LLMs at enterprise scale — managing GPU costs, optimizing inference with TensorRT LLM, and building an AI platform for HR tech. In this conversation, he breaks down exactly how his team cut latency by 70%, reduced GPU spend through counterintuitive scaling strategies, and navigated the messy reality of taking AI from proof-of-concept to production.


How We Cut LLM Latency 70% With TensorRT in Production // MLOps Podcast #369 with Maher Hanafi, SVP of Engineering at Betterworks


Key topics covered:

The AI Iceberg — Why the invisible work behind AI (performance, latency, throughput, cost, accuracy) is harder than building the features themselves

GPU Cost Optimization — How upgrading to more expensive GPUs actually saved money by reducing total runtime hours

TensorRT LLM Deep Dive — Rewiring neural networks to match GPU architecture for 50-70% latency reduction

Cold Start Solutions — Using AWS FSx, baking models into container images, and cutting minutes off spin-up times

KV Cache & In-Flight Batching — Why using one model per GPU with maximum KV cache beats cramming multiple models together

Scheduled & Dynamic Scaling — Pattern-based scaling for HR tech workloads (nights, weekends, end-of-quarter spikes)

Verticalized AI Platform — Building horizontal AI infrastructure that serves multiple HR product verticals

AI Engineering Lab — How junior vs. senior engineers adopted AI coding tools differently, and the cultural shift that followed

Agentic Coding in Practice — Navigating AI coding agent costs, quality control, and redefining the SDLC

Chinese Models & Compliance — Why enterprise customers block DeepSeek/Qwen and the geopolitics of model training data


This episode is for engineering leaders building AI in production, MLOps engineers optimizing GPU infrastructure, and anyone navigating the gap between AI demos and enterprise-scale deployment.


Links & Resources:

TensorRT LLM: https://github.com/NVIDIA/TensorRT-LLM

NVIDIA Run: ai Model Streamer (cold start optimization): https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/

vLLM vs TensorRT-LLM comparison: https://northflank.com/blog/vllm-vs-tensorrt-llm-and-how-to-run-them


Timestamps:

[00:00] Optimizing GPU Usage and Latency

[00:21] Learning AI as Leadership

[04:34] AI Cost Centers

[13:56] Throughput and Infrastructure Efficiency

[18:10] Scaling and Unit Economics

[24:14] Championing AI ROI

[36:11] Queue to Value Engine

[41:30] Failed Product Features

[46:12] Agentic Engineering Costs

[58:49] AI Self-Hosting in Engineering

[1:04:40] Wrap up

...more
View all episodesView all episodes
Download on the App Store

MLOps.communityBy Demetrios

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

23 ratings


More shows like MLOps.community

View all
This Week in Startups by Jason Calacanis

This Week in Startups

1,296 Listeners

The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

288 Listeners

The a16z Show by Andreessen Horowitz

The a16z Show

1,105 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

626 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

583 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

306 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

343 Listeners

Practical AI by Practical AI LLC

Practical AI

212 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

551 Listeners

Big Technology Podcast by Alex Kantrowitz

Big Technology Podcast

512 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

150 Listeners

Latent Space: The AI Engineer Podcast by Latent.Space

Latent Space: The AI Engineer Podcast

101 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

228 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

688 Listeners

AI + a16z by a16z

AI + a16z

34 Listeners