March 25, 2026

Episode: Speculative Speculative Decoding

This episode explores Speculative Speculative Decoding, a technique for reducing LLM inference latency by overlapping drafting and verification more aggressively than standard speculative decoding. It explains how the method predicts likely verification outcomes in advance so the draft model can prepare multiple next-step continuations while the target model is still checking the current block, with the target model still preserving the final output. The discussion focuses on the Saguaro algorithm, the distinction between true parallel generation and latency-hiding around autoregressive dependencies, and the practical tradeoff between useful overlap and wasted draft-side computation. A listener would find it interesting for its clear look at where modern inference systems still lose time and how smarter scheduling, rather than changing model semantics, can unlock additional speed.

Sources:

1. Speculative Speculative Decoding — Tanishq Kumar, Tri Dao, Avner May, 2026

http://arxiv.org/abs/2603.03251

2. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023

https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding

3. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024

https://scholar.google.com/scholar?q=EAGLE:+Speculative+Sampling+Requires+Rethinking+Feature+Uncertainty

4. SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification — Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia, 2024

https://scholar.google.com/scholar?q=SpecInfer:+Accelerating+Generative+LLM+Serving+with+Speculative+Inference+and+Token+Tree+Verification

5. PEARL: Parallel Speculative Decoding with Adaptive Draft Length — Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, Xiao Sun, 2025

https://scholar.google.com/scholar?q=PEARL:+Parallel+Speculative+Decoding+with+Adaptive+Draft+Length

6. AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration — Bradley McDanel, 2025

https://scholar.google.com/scholar?q=AMUSD:+Asynchronous+Multi-Device+Speculative+Decoding+for+LLM+Acceleration

7. Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding — Jun Zhang et al., 2023

https://scholar.google.com/scholar?q=Draft+&+Verify:+Lossless+Large+Language+Model+Acceleration+via+Self-Speculative+Decoding

8. SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration — Heming Xia et al., 2024

https://scholar.google.com/scholar?q=SWIFT:+On-the-Fly+Self-Speculative+Decoding+for+LLM+Inference+Acceleration

9. SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths — Kaixuan Huang, Xudong Guo, Mengdi Wang, 2024

https://scholar.google.com/scholar?q=SpecDec++:+Boosting+Speculative+Decoding+via+Adaptive+Candidate+Lengths

10. Accelerating Transformer Inference for Translation via Parallel Decoding — Andrea Santilli et al., 2023

https://scholar.google.com/scholar?q=Accelerating+Transformer+Inference+for+Translation+via+Parallel+Decoding

11. Self-Speculative Decoding for LLM-based ASR with CTC Encoder Drafts — George Saon et al., 2026

https://scholar.google.com/scholar?q=Self-Speculative+Decoding+for+LLM-based+ASR+with+CTC+Encoder+Drafts

12. AI Post Transformers: Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/apples-speculative-streaming-fast-llm-inference-without-auxiliary-models/

13. AI Post Transformers: Adaptive Control for Batched Speculative Decoding in LLM Serving — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/adaptive-control-for-batched-speculative-decoding-in-llm-serving/

14. AI Post Transformers: FastGRPO: Concurrency-Aware Speculative Decoding for Policy Optimization — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/fastgrpo-concurrency-aware-speculative-decoding-for-policy-optimization/

Interactive Visualization: Episode: Speculative Speculative Decoding

...more

View all episodes

By mcgrof