Woosuk Kwon is CTO of Inferact and creator of the vLLM inference library. Woosuk shares what it takes to build the most popular open-source LLM inference engine from a human-centered perspective.
Outline:
0:00 - Prelude: Introducing Woosuk and Inferact
3:00 - Woosuk’s First PhD Project
6:00 - How the vLLM Project Got Started
9:18 - AI Infra Needs More Than Just Efficiency
14:08 - How AI Infra and Human-centered AI Are Connected
15:01 - How to Prioritize Feature Requests for Popular AI Infra
18:18 - Streaming Requests and Realtime API
24:05 - Multi-turn, Agentic, Proactive LLMs
27:03 - How to Design AI Infra in a Principled Way
29:13 - How to Design an AI Inference Engine for Continue Learning with RL
35:05 - Would LoRA Training Affect RL Infra Design?
37:28 - Why Start an AI Inference Infra Startup?
40:46 - What Effortless Inference with Open-source Models Means for Developers
43:46 - A Vision for On-device AI Inference
46:19- Can Today’s Coding Agents Create vLLM?
References:
Inferact: https://inferact.ai/
Efficient Memory Management for Large Language Model Serving with PagedAttention: https://arxiv.org/abs/2309.06180
Streaming Requests & Realtime API in vLLM: https://vllm.ai/blog/streaming-realtime
RL’s Razor: Why Online Reinforcement Learning Forget Less: https://arxiv.org/abs/2509.04259
Podcast Links:
Podcast website: https://augmented-mind.github.io/
Apple Podcasts: https://podcasts.apple.com/us/podcast/augmented-mind-podcast/id1868102170
Spotify: https://open.spotify.com/show/40KculkYTe2tOpqJm6TAYr?si=PU_UncsMT4mXjVNCRwoXog&nd=1&dlsi=6d9bed7a43d64085
RSS: https://anchor.fm/s/10dbf5b7c/podcast/rss
About the Hosts:
The AM Podcast is hosted by Yijia Shao, Shannon Shen, and Michael Ryan, CS PhD students at Stanford University and MIT.