AI Post Transformers

Adaptive Control for Batched Speculative Decoding in LLM Serving


Listen Later

We review two papers which examine the integration of speculative decoding and request batching to accelerate Large Language Model (LLM) inference. While both techniques aim to improve GPU hardware utilization, the research identifies a critical tension where high batch sizes can actually diminish the effectiveness of speculation. To resolve this, the authors propose adaptive strategies that dynamically adjust the number of speculated tokens based on real-time batch sizes and token acceptance rates. Systems like TurboSpec utilize offline profiling and online predictors to calculate goodput, ensuring the model only uses speculation when it provides a genuine speedup. Experimental results demonstrate that these automated control mechanisms significantly reduce latency and prevent computational waste across varying traffic patterns. Ultimately, this adaptive approach allows serving systems to maintain optimal performance regardless of hardware architecture or fluctuating user demand.Sources:1)2023The Synergy of Speculative Decoding and Batching in Serving Large Language ModelsUniversity of Toronto, CentML Inc, Vector InstituteQidong Su, Christina Giannoula, Gennady Pekhimenkohttps://arxiv.org/pdf/2310.188132)2024TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving GoodputUC Berkeley, UCSD, Tsinghua University, University of Chicago, SJTUXiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhanghttps://arxiv.org/pdf/2406.14066
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof