April 03, 2026

Batch-Aware Expert Routing for Faster MoE Decoding

This episode explores a practical systems paper on speeding up Mixture-of-Experts language models at inference time by changing how tokens are routed during decoding, without any retraining. It explains why MoE models, despite using sparse per-token computation, can still be slow in real-world serving because small decode batches activate a large union of different experts, making inference memory-bound due to irregular weight loading. The discussion highlights the paper’s central argument that routing should be batch-aware rather than token-local, so expert choices account for which experts are already being loaded for other tokens in the batch. Listeners would find it interesting for its clear explanation of the gap between MoE’s theoretical efficiency and deployment reality, and for its focus on a low-cost serving optimization with direct economic impact on LLM inference.

Sources:

1. Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining — Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun, 2025

http://arxiv.org/abs/2511.02237

2. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer — Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017

https://scholar.google.com/scholar?q=Outrageously+Large+Neural+Networks:+The+Sparsely-Gated+Mixture-of-Experts+Layer

3. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — William Fedus, Barret Zoph, Noam Shazeer, 2021

https://scholar.google.com/scholar?q=Switch+Transformers:+Scaling+to+Trillion+Parameter+Models+with+Simple+and+Efficient+Sparsity

4. MegaBlocks: Efficient Sparse Training with Mixture-of-Experts — Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He and collaborators, 2023

https://scholar.google.com/scholar?q=MegaBlocks:+Efficient+Sparse+Training+with+Mixture-of-Experts

5. Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining — Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun, 2025

https://scholar.google.com/scholar?q=Opportunistic+Expert+Activation:+Batch-Aware+Expert+Routing+for+Faster+Decode+Without+Retraining

6. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Hao Zhang, Eric Gonzalez, Ion Stoica, Joseph E. Gonzalez, 2023

https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention

7. SGLang: Efficient Execution of Structured Language Model Programs — Lianmin Zheng and collaborators, 2024

https://scholar.google.com/scholar?q=SGLang:+Efficient+Execution+of+Structured+Language+Model+Programs

8. DeepSeek-V3 Technical Report — DeepSeek-AI / Liu et al., 2024

https://scholar.google.com/scholar?q=DeepSeek-V3+Technical+Report

9. Kimi K2 Technical Report — Kimi Team, 2025

https://scholar.google.com/scholar?q=Kimi+K2+Technical+Report

10. Qwen3 Technical Report — Yang et al., 2025

https://scholar.google.com/scholar?q=Qwen3+Technical+Report

11. The Roofline Model: A Pedagogical Tool for Program Analysis and Optimization — Samuel Williams, Andrew Waterman, David Patterson, 2009

https://scholar.google.com/scholar?q=The+Roofline+Model:+A+Pedagogical+Tool+for+Program+Analysis+and+Optimization

12. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022

https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness

13. Moe-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache — approx. recent systems paper, exact authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Moe-Infinity:+Efficient+MoE+Inference+on+Personal+Machines+with+Sparsity-Aware+Expert+Cache

14. Diff-MoE: Efficient Batched MoE Inference with Priority-Driven Differential Expert Caching — approx. recent systems paper, exact authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Diff-MoE:+Efficient+Batched+MoE+Inference+with+Priority-Driven+Differential+Expert+Caching

15. SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference — approx. recent systems paper, exact authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=SliceMoE:+Bit-Sliced+Expert+Caching+under+Miss-Rate+Constraints+for+Efficient+MoE+Inference

16. A Survey on Inference Optimization Techniques for Mixture of Experts Models — approx. recent survey, exact authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=A+Survey+on+Inference+Optimization+Techniques+for+Mixture+of+Experts+Models

17. Rewiring Experts on the Fly: Continuous Rerouting for Better Online Adaptation in Mixture-of-Expert Models — approx. recent paper, exact authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Rewiring+Experts+on+the+Fly:+Continuous+Rerouting+for+Better+Online+Adaptation+in+Mixture-of-Expert+Models

18. Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers — approx. recent paper, exact authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Stabilizing+MoE+Reinforcement+Learning+by+Aligning+Training+and+Inference+Routers

19. AI Post Transformers: Switch Transformers: Trillion Parameter Models with Sparsity — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/switch-transformers-trillion-parameter-models-with-sparsity/

20. AI Post Transformers: LFM2-8B-A1B: Efficient On-Device Mixture-of-Experts — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/lfm2-8b-a1b-efficient-on-device-mixture-of-experts/

21. AI Post Transformers: FlexGen: High-Throughput LLM Inference on a Single GPU — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/flexgen-high-throughput-llm-inference-on-a-single-gpu/

22. AI Post Transformers: FlashAttention-2: Faster Attention with Better Parallelism — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/flashattention-2-faster-attention-with-better-parallelism/

23. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3

24. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/fast26-bidaw-enhancing-key-value-caching-for-interactive-llm-serving-via-bidirec/

Interactive Visualization: Batch-Aware Expert Routing for Faster MoE Decoding

...more

View all episodes

By mcgrof