April 07, 2026

Speculative Decoding in Real vLLM Serving

30 minutes

This episode explores whether speculative decoding’s widely cited inference speedups survive real deployment conditions, using a January 2026 UC Berkeley paper that evaluates the method inside vLLM rather than in idealized toy benchmarks. It explains the core mechanics of draft-and-verify decoding, then digs into why acceptance length, verification cost, scheduler behavior, batching, KV-cache management, and long generations can erase much of the theoretical advantage in production serving stacks. The discussion also clarifies the difference between speculative decoding and multi-token prediction, situating approaches like MEDUSA and EAGLE within the broader effort to reduce autoregressive bottlenecks. Listeners interested in LLM systems will find it compelling because it shifts the conversation from flashy benchmark bar charts to the practical question of what actually improves wall-clock latency for real workloads.

Sources:

1. Speculative Decoding: Performance or Illusion? — Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung, 2025

http://arxiv.org/abs/2601.11580

2. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022

https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness

3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023

https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention

4. Ring Attention with Blockwise Transformers for Near-Infinite Context — William Bevington, et al., 2023

https://scholar.google.com/scholar?q=Ring+Attention+with+Blockwise+Transformers+for+Near-Infinite+Context

5. Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023

https://scholar.google.com/scholar?q=Speculative+Decoding:+Exploiting+Speculative+Execution+for+Accelerating+Seq2seq+Generation

6. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023

https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding

7. MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengxu Chen, et al., 2024

https://scholar.google.com/scholar?q=MEDUSA:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads

8. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Yuhui Li, et al., 2024

https://scholar.google.com/scholar?q=EAGLE:+Speculative+Sampling+Requires+Rethinking+Feature+Uncertainty

9. Speculative Decoding: Performance or Illusion? — Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung, 2026

https://scholar.google.com/scholar?q=Speculative+Decoding:+Performance+or+Illusion?

10. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al., 2023

https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention

11. EAGLE-3 — Authors as cited in the paper's related work, 2024

https://scholar.google.com/scholar?q=EAGLE-3

12. Multi-Token Prediction — Liu et al.; Zeng et al., 2025

https://scholar.google.com/scholar?q=Multi-Token+Prediction

13. Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding — Xia et al., 2024

https://scholar.google.com/scholar?q=Unlocking+Efficiency+in+Large+Language+Model+Inference:+A+Comprehensive+Survey+of+Speculative+Decoding

14. A Systematic Study of Speculative Decoding in Computation-Bound Regimes — Liu et al., 2024

https://scholar.google.com/scholar?q=A+Systematic+Study+of+Speculative+Decoding+in+Computation-Bound+Regimes

15. N-Gram Speculative Decoding — Saxena; Somasundaram et al., 2023/2024

https://scholar.google.com/scholar?q=N-Gram+Speculative+Decoding

16. Determinism and Nondeterminism in LLM Inference — He, 2025

https://scholar.google.com/scholar?q=Determinism+and+Nondeterminism+in+LLM+Inference

17. Block Verification Accelerates Speculative Decoding — unknown from snippet, likely 2024-2025

https://scholar.google.com/scholar?q=Block+Verification+Accelerates+Speculative+Decoding

18. Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding — Zhang et al. / likely 2024, 2024

https://scholar.google.com/scholar?q=Draft+&+Verify:+Lossless+Large+Language+Model+Acceleration+via+Self-Speculative+Decoding

19. MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding — unknown from snippet, likely 2024-2025

https://scholar.google.com/scholar?q=MagicDec:+Breaking+the+Latency-Throughput+Tradeoff+for+Long+Context+Generation+with+Speculative+Decoding

20. SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths — unknown from snippet, likely 2024-2025

https://scholar.google.com/scholar?q=SpecDec++:+Boosting+Speculative+Decoding+via+Adaptive+Candidate+Lengths

21. Adaptive Speculative Decoding for Large Language Models — unknown from snippet, likely 2024

https://scholar.google.com/scholar?q=Adaptive+Speculative+Decoding+for+Large+Language+Models

22. Opt-Tree: Speculative Decoding with Adaptive Draft Tree Structure — unknown from snippet, likely 2024-2025

https://scholar.google.com/scholar?q=Opt-Tree:+Speculative+Decoding+with+Adaptive+Draft+Tree+Structure

23. Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation — unknown from snippet, likely 2024-2025

https://scholar.google.com/scholar?q=Draft+Model+Knows+When+to+Stop:+Self-Verification+Speculative+Decoding+for+Long-Form+Generation

24. Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding — unknown from snippet, likely 2025

https://scholar.google.com/scholar?q=Draft+Model+Knows+When+to+Stop:+A+Self-Verification+Length+Policy+for+Speculative+Decoding

25. AI Post Transformers: Adaptive Control for Batched Speculative Decoding in LLM Serving — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/adaptive-control-for-batched-speculative-decoding-in-llm-serving/

26. AI Post Transformers: Building Production-Ready Speculative Decoding with TensorRT-LLM — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/building-production-ready-speculative-decoding-with-tensorrt-llm/

27. AI Post Transformers: Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/apples-speculative-streaming-fast-llm-inference-without-auxiliary-models/

28. AI Post Transformers: Episode: Speculative Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-speculative-speculative-decoding-1b7a10.mp3

29. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/

30. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3

Interactive Visualization: Speculative Decoding in Real vLLM Serving

...more

View all episodes

By mcgrof