March 31, 2026

A User-Centric Perspective on LLM Inference | AM Podcast #3

49 minutes

Woosuk Kwon is CTO of Inferact and creator of the vLLM inference library. Woosuk shares what it takes to build the most popular open-source LLM inference engine from a human-centered perspective.

Outline:

0:00 - Prelude: Introducing Woosuk and Inferact

3:00 - Woosuk’s First PhD Project

6:00 - How the vLLM Project Got Started

9:18 - AI Infra Needs More Than Just Efficiency

14:08 - How AI Infra and Human-centered AI Are Connected

15:01 - How to Prioritize Feature Requests for Popular AI Infra

18:18 - Streaming Requests and Realtime API

24:05 - Multi-turn, Agentic, Proactive LLMs

27:03 - How to Design AI Infra in a Principled Way

29:13 - How to Design an AI Inference Engine for Continue Learning with RL

35:05 - Would LoRA Training Affect RL Infra Design?

37:28 - Why Start an AI Inference Infra Startup?

40:46 - What Effortless Inference with Open-source Models Means for Developers

43:46 - A Vision for On-device AI Inference

46:19- Can Today’s Coding Agents Create vLLM?

References:

Inferact: https://inferact.ai/

Efficient Memory Management for Large Language Model Serving with PagedAttention: https://arxiv.org/abs/2309.06180

Streaming Requests & Realtime API in vLLM: https://vllm.ai/blog/streaming-realtime

RL’s Razor: Why Online Reinforcement Learning Forget Less: https://arxiv.org/abs/2509.04259

Podcast Links:

Podcast website: https://augmented-mind.github.io/

Apple Podcasts: https://podcasts.apple.com/us/podcast/augmented-mind-podcast/id1868102170

Spotify: https://open.spotify.com/show/40KculkYTe2tOpqJm6TAYr?si=PU_UncsMT4mXjVNCRwoXog&nd=1&dlsi=6d9bed7a43d64085

RSS: https://anchor.fm/s/10dbf5b7c/podcast/rss

About the Hosts:

The AM Podcast is hosted by Yijia Shao, Shannon Shen, and Michael Ryan, CS PhD students at Stanford University and MIT.

...more

View all episodes

By Yijia Shao, Shannon Shen, Michael Ryan

March 31, 2026

A User-Centric Perspective on LLM Inference | AM Podcast #3

49 minutes

Woosuk Kwon is CTO of Inferact and creator of the vLLM inference library. Woosuk shares what it takes to build the most popular open-source LLM inference engine from a human-centered perspective.

Outline:

0:00 - Prelude: Introducing Woosuk and Inferact

3:00 - Woosuk’s First PhD Project

6:00 - How the vLLM Project Got Started

9:18 - AI Infra Needs More Than Just Efficiency

14:08 - How AI Infra and Human-centered AI Are Connected

15:01 - How to Prioritize Feature Requests for Popular AI Infra

18:18 - Streaming Requests and Realtime API

24:05 - Multi-turn, Agentic, Proactive LLMs

27:03 - How to Design AI Infra in a Principled Way

29:13 - How to Design an AI Inference Engine for Continue Learning with RL

35:05 - Would LoRA Training Affect RL Infra Design?

37:28 - Why Start an AI Inference Infra Startup?

40:46 - What Effortless Inference with Open-source Models Means for Developers

43:46 - A Vision for On-device AI Inference

46:19- Can Today’s Coding Agents Create vLLM?

References:

Inferact: https://inferact.ai/

Efficient Memory Management for Large Language Model Serving with PagedAttention: https://arxiv.org/abs/2309.06180

Streaming Requests & Realtime API in vLLM: https://vllm.ai/blog/streaming-realtime

RL’s Razor: Why Online Reinforcement Learning Forget Less: https://arxiv.org/abs/2509.04259

Podcast Links:

Podcast website: https://augmented-mind.github.io/

Apple Podcasts: https://podcasts.apple.com/us/podcast/augmented-mind-podcast/id1868102170

Spotify: https://open.spotify.com/show/40KculkYTe2tOpqJm6TAYr?si=PU_UncsMT4mXjVNCRwoXog&nd=1&dlsi=6d9bed7a43d64085

RSS: https://anchor.fm/s/10dbf5b7c/podcast/rss

About the Hosts:

The AM Podcast is hosted by Yijia Shao, Shannon Shen, and Michael Ryan, CS PhD students at Stanford University and MIT.

...more

Share A User-Centric Perspective on LLM Inference | AM Podcast #3

Sign up to save your podcasts

A User-Centric Perspective on LLM Inference | AM Podcast #3

A User-Centric Perspective on LLM Inference | AM Podcast #3