AI Post Transformers

Batch-Aware Expert Routing for Faster MoE Decoding


Listen Later

This episode explores a practical systems paper on speeding up Mixture-of-Experts language models at inference time by changing how tokens are routed during decoding, without any retraining. It explains why MoE models, despite using sparse per-token computation, can still be slow in real-world serving because small decode batches activate a large union of different experts, making inference memory-bound due to irregular weight loading. The discussion highlights the paper’s central argument that routing should be batch-aware rather than token-local, so expert choices account for which experts are already being loaded for other tokens in the batch. Listeners would find it interesting for its clear explanation of the gap between MoE’s theoretical efficiency and deployment reality, and for its focus on a low-cost serving optimization with direct economic impact on LLM inference.
Sources:
1. Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining — Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun, 2025
http://arxiv.org/abs/2511.02237
2. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer — Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017
https://scholar.google.com/scholar?q=Outrageously+Large+Neural+Networks:+The+Sparsely-Gated+Mixture-of-Experts+Layer
3. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — William Fedus, Barret Zoph, Noam Shazeer, 2021
https://scholar.google.com/scholar?q=Switch+Transformers:+Scaling+to+Trillion+Parameter+Models+with+Simple+and+Efficient+Sparsity
4. MegaBlocks: Efficient Sparse Training with Mixture-of-Experts — Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He and collaborators, 2023
https://scholar.google.com/scholar?q=MegaBlocks:+Efficient+Sparse+Training+with+Mixture-of-Experts
5. Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining — Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun, 2025
https://scholar.google.com/scholar?q=Opportunistic+Expert+Activation:+Batch-Aware+Expert+Routing+for+Faster+Decode+Without+Retraining
6. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Hao Zhang, Eric Gonzalez, Ion Stoica, Joseph E. Gonzalez, 2023
https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention
7. SGLang: Efficient Execution of Structured Language Model Programs — Lianmin Zheng and collaborators, 2024
https://scholar.google.com/scholar?q=SGLang:+Efficient+Execution+of+Structured+Language+Model+Programs
8. DeepSeek-V3 Technical Report — DeepSeek-AI / Liu et al., 2024
https://scholar.google.com/scholar?q=DeepSeek-V3+Technical+Report
9. Kimi K2 Technical Report — Kimi Team, 2025
https://scholar.google.com/scholar?q=Kimi+K2+Technical+Report
10. Qwen3 Technical Report — Yang et al., 2025
https://scholar.google.com/scholar?q=Qwen3+Technical+Report
11. The Roofline Model: A Pedagogical Tool for Program Analysis and Optimization — Samuel Williams, Andrew Waterman, David Patterson, 2009
https://scholar.google.com/scholar?q=The+Roofline+Model:+A+Pedagogical+Tool+for+Program+Analysis+and+Optimization
12. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022
https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness
13. Moe-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache — approx. recent systems paper, exact authors unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Moe-Infinity:+Efficient+MoE+Inference+on+Personal+Machines+with+Sparsity-Aware+Expert+Cache
14. Diff-MoE: Efficient Batched MoE Inference with Priority-Driven Differential Expert Caching — approx. recent systems paper, exact authors unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Diff-MoE:+Efficient+Batched+MoE+Inference+with+Priority-Driven+Differential+Expert+Caching
15. SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference — approx. recent systems paper, exact authors unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=SliceMoE:+Bit-Sliced+Expert+Caching+under+Miss-Rate+Constraints+for+Efficient+MoE+Inference
16. A Survey on Inference Optimization Techniques for Mixture of Experts Models — approx. recent survey, exact authors unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=A+Survey+on+Inference+Optimization+Techniques+for+Mixture+of+Experts+Models
17. Rewiring Experts on the Fly: Continuous Rerouting for Better Online Adaptation in Mixture-of-Expert Models — approx. recent paper, exact authors unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Rewiring+Experts+on+the+Fly:+Continuous+Rerouting+for+Better+Online+Adaptation+in+Mixture-of-Expert+Models
18. Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers — approx. recent paper, exact authors unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Stabilizing+MoE+Reinforcement+Learning+by+Aligning+Training+and+Inference+Routers
19. AI Post Transformers: Switch Transformers: Trillion Parameter Models with Sparsity — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/switch-transformers-trillion-parameter-models-with-sparsity/
20. AI Post Transformers: LFM2-8B-A1B: Efficient On-Device Mixture-of-Experts — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/lfm2-8b-a1b-efficient-on-device-mixture-of-experts/
21. AI Post Transformers: FlexGen: High-Throughput LLM Inference on a Single GPU — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/flexgen-high-throughput-llm-inference-on-a-single-gpu/
22. AI Post Transformers: FlashAttention-2: Faster Attention with Better Parallelism — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/flashattention-2-faster-attention-with-better-parallelism/
23. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3
24. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/fast26-bidaw-enhancing-key-value-caching-for-interactive-llm-serving-via-bidirec/
Interactive Visualization: Batch-Aware Expert Routing for Faster MoE Decoding
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof