February 27, 2025

LLM Inference - Optimizing Latency, Throughput, and Scalability

Listen Later

12 minutes

Deploying Large Language Models (LLMs) for inference is a complex yet rewarding process that requires balancing performance, cost, and scalability. Optimizing and sizing LLM inference systems involves understanding tradeoffs, selecting the right tools, and leveraging NVIDIA’s advanced technologies like TensorRT-LLM, Triton Inference Server, and NVIDIA Inference Microservices (NIM). This guide explores the key techniques and strategies for efficient LLM deployment.

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

Continuous improvement

By Victor Leung

February 27, 2025

LLM Inference - Optimizing Latency, Throughput, and Scalability

Listen Later

12 minutes

Deploying Large Language Models (LLMs) for inference is a complex yet rewarding process that requires balancing performance, cost, and scalability. Optimizing and sizing LLM inference systems involves understanding tradeoffs, selecting the right tools, and leveraging NVIDIA’s advanced technologies like TensorRT-LLM, Triton Inference Server, and NVIDIA Inference Microservices (NIM). This guide explores the key techniques and strategies for efficient LLM deployment.

...more

More shows like Continuous improvement

Odd Lots by Bloomberg

Odd Lots

1,866 Listeners

Stuff They Don't Want You To Know by iHeartPodcasts

Stuff They Don't Want You To Know

10,331 Listeners

The Daily by The New York Times

The Daily

112,433 Listeners

Consider This from NPR by NPR

Consider This from NPR

6,384 Listeners

History As It Happens by Martin Di Caro

History As It Happens

69 Listeners