This episode explores Speculative Speculative Decoding, a technique for reducing LLM inference latency by overlapping drafting and verification more aggressively than standard speculative decoding. It explains how the method predicts likely verification outcomes in advance so the draft model can prepare multiple next-step continuations while the target model is still checking the current block, with the target model still preserving the final output. The discussion focuses on the Saguaro algorithm, the distinction between true parallel generation and latency-hiding around autoregressive dependencies, and the practical tradeoff between useful overlap and wasted draft-side computation. A listener would find it interesting for its clear look at where modern inference systems still lose time and how smarter scheduling, rather than changing model semantics, can unlock additional speed.
Sources:
1. Speculative Speculative Decoding — Tanishq Kumar, Tri Dao, Avner May, 2026
http://arxiv.org/abs/2603.03251
2. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023
https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding
3. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024
https://scholar.google.com/scholar?q=EAGLE:+Speculative+Sampling+Requires+Rethinking+Feature+Uncertainty
4. SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification — Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia, 2024
https://scholar.google.com/scholar?q=SpecInfer:+Accelerating+Generative+LLM+Serving+with+Speculative+Inference+and+Token+Tree+Verification
5. PEARL: Parallel Speculative Decoding with Adaptive Draft Length — Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, Xiao Sun, 2025
https://scholar.google.com/scholar?q=PEARL:+Parallel+Speculative+Decoding+with+Adaptive+Draft+Length
6. AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration — Bradley McDanel, 2025
https://scholar.google.com/scholar?q=AMUSD:+Asynchronous+Multi-Device+Speculative+Decoding+for+LLM+Acceleration
7. Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding — Jun Zhang et al., 2023
https://scholar.google.com/scholar?q=Draft+&+Verify:+Lossless+Large+Language+Model+Acceleration+via+Self-Speculative+Decoding
8. SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration — Heming Xia et al., 2024
https://scholar.google.com/scholar?q=SWIFT:+On-the-Fly+Self-Speculative+Decoding+for+LLM+Inference+Acceleration
9. SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths — Kaixuan Huang, Xudong Guo, Mengdi Wang, 2024
https://scholar.google.com/scholar?q=SpecDec++:+Boosting+Speculative+Decoding+via+Adaptive+Candidate+Lengths
10. Accelerating Transformer Inference for Translation via Parallel Decoding — Andrea Santilli et al., 2023
https://scholar.google.com/scholar?q=Accelerating+Transformer+Inference+for+Translation+via+Parallel+Decoding
11. Self-Speculative Decoding for LLM-based ASR with CTC Encoder Drafts — George Saon et al., 2026
https://scholar.google.com/scholar?q=Self-Speculative+Decoding+for+LLM-based+ASR+with+CTC+Encoder+Drafts
12. AI Post Transformers: Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/apples-speculative-streaming-fast-llm-inference-without-auxiliary-models/
13. AI Post Transformers: Adaptive Control for Batched Speculative Decoding in LLM Serving — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/adaptive-control-for-batched-speculative-decoding-in-llm-serving/
14. AI Post Transformers: FastGRPO: Concurrency-Aware Speculative Decoding for Policy Optimization — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/fastgrpo-concurrency-aware-speculative-decoding-for-policy-optimization/
Interactive Visualization: Episode: Speculative Speculative Decoding