AI Post Transformers

Speculative Decoding in Real vLLM Serving


Listen Later

This episode explores whether speculative decoding’s widely cited inference speedups survive real deployment conditions, using a January 2026 UC Berkeley paper that evaluates the method inside vLLM rather than in idealized toy benchmarks. It explains the core mechanics of draft-and-verify decoding, then digs into why acceptance length, verification cost, scheduler behavior, batching, KV-cache management, and long generations can erase much of the theoretical advantage in production serving stacks. The discussion also clarifies the difference between speculative decoding and multi-token prediction, situating approaches like MEDUSA and EAGLE within the broader effort to reduce autoregressive bottlenecks. Listeners interested in LLM systems will find it compelling because it shifts the conversation from flashy benchmark bar charts to the practical question of what actually improves wall-clock latency for real workloads.
Sources:
1. Speculative Decoding: Performance or Illusion? — Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung, 2025
http://arxiv.org/abs/2601.11580
2. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022
https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness
3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023
https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention
4. Ring Attention with Blockwise Transformers for Near-Infinite Context — William Bevington, et al., 2023
https://scholar.google.com/scholar?q=Ring+Attention+with+Blockwise+Transformers+for+Near-Infinite+Context
5. Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023
https://scholar.google.com/scholar?q=Speculative+Decoding:+Exploiting+Speculative+Execution+for+Accelerating+Seq2seq+Generation
6. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023
https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding
7. MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengxu Chen, et al., 2024
https://scholar.google.com/scholar?q=MEDUSA:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads
8. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Yuhui Li, et al., 2024
https://scholar.google.com/scholar?q=EAGLE:+Speculative+Sampling+Requires+Rethinking+Feature+Uncertainty
9. Speculative Decoding: Performance or Illusion? — Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung, 2026
https://scholar.google.com/scholar?q=Speculative+Decoding:+Performance+or+Illusion?
10. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al., 2023
https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention
11. EAGLE-3 — Authors as cited in the paper's related work, 2024
https://scholar.google.com/scholar?q=EAGLE-3
12. Multi-Token Prediction — Liu et al.; Zeng et al., 2025
https://scholar.google.com/scholar?q=Multi-Token+Prediction
13. Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding — Xia et al., 2024
https://scholar.google.com/scholar?q=Unlocking+Efficiency+in+Large+Language+Model+Inference:+A+Comprehensive+Survey+of+Speculative+Decoding
14. A Systematic Study of Speculative Decoding in Computation-Bound Regimes — Liu et al., 2024
https://scholar.google.com/scholar?q=A+Systematic+Study+of+Speculative+Decoding+in+Computation-Bound+Regimes
15. N-Gram Speculative Decoding — Saxena; Somasundaram et al., 2023/2024
https://scholar.google.com/scholar?q=N-Gram+Speculative+Decoding
16. Determinism and Nondeterminism in LLM Inference — He, 2025
https://scholar.google.com/scholar?q=Determinism+and+Nondeterminism+in+LLM+Inference
17. Block Verification Accelerates Speculative Decoding — unknown from snippet, likely 2024-2025
https://scholar.google.com/scholar?q=Block+Verification+Accelerates+Speculative+Decoding
18. Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding — Zhang et al. / likely 2024, 2024
https://scholar.google.com/scholar?q=Draft+&+Verify:+Lossless+Large+Language+Model+Acceleration+via+Self-Speculative+Decoding
19. MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding — unknown from snippet, likely 2024-2025
https://scholar.google.com/scholar?q=MagicDec:+Breaking+the+Latency-Throughput+Tradeoff+for+Long+Context+Generation+with+Speculative+Decoding
20. SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths — unknown from snippet, likely 2024-2025
https://scholar.google.com/scholar?q=SpecDec++:+Boosting+Speculative+Decoding+via+Adaptive+Candidate+Lengths
21. Adaptive Speculative Decoding for Large Language Models — unknown from snippet, likely 2024
https://scholar.google.com/scholar?q=Adaptive+Speculative+Decoding+for+Large+Language+Models
22. Opt-Tree: Speculative Decoding with Adaptive Draft Tree Structure — unknown from snippet, likely 2024-2025
https://scholar.google.com/scholar?q=Opt-Tree:+Speculative+Decoding+with+Adaptive+Draft+Tree+Structure
23. Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation — unknown from snippet, likely 2024-2025
https://scholar.google.com/scholar?q=Draft+Model+Knows+When+to+Stop:+Self-Verification+Speculative+Decoding+for+Long-Form+Generation
24. Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding — unknown from snippet, likely 2025
https://scholar.google.com/scholar?q=Draft+Model+Knows+When+to+Stop:+A+Self-Verification+Length+Policy+for+Speculative+Decoding
25. AI Post Transformers: Adaptive Control for Batched Speculative Decoding in LLM Serving — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/adaptive-control-for-batched-speculative-decoding-in-llm-serving/
26. AI Post Transformers: Building Production-Ready Speculative Decoding with TensorRT-LLM — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/building-production-ready-speculative-decoding-with-tensorrt-llm/
27. AI Post Transformers: Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/apples-speculative-streaming-fast-llm-inference-without-auxiliary-models/
28. AI Post Transformers: Episode: Speculative Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-speculative-speculative-decoding-1b7a10.mp3
29. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/
30. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3
Interactive Visualization: Speculative Decoding in Real vLLM Serving
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof