August 13, 2025

fMoE: Fine-Grained Expert Offloading for MoE Serving

12 minutes

This February 2025 paper introduces fMoE, a novel fine-grained expert offloading system designed to optimize the serving efficiency of Mixture-of-Experts (MoE) Large Language Models (LLMs). The paper highlights the memory inefficiency of current MoE-based LLMs during inference due to inactive experts residing in GPU memory and the limitations of existing coarse-grained offloading solutions that struggle with latency-memory trade-offs. fMoE addresses these challenges by tracking iteration-level expert probability distributions through "expert maps" and leveraging input semantic embeddings to intelligently guide expert prefetching, caching, and offloading decisions. Experiments show that fMoE significantly reduces inference latency and improves expert hit rates compared to state-of-the-art methods.

Source: https://arxiv.org/html/2502.05370v1

...more

View all episodes

By mcgrof

August 13, 2025

fMoE: Fine-Grained Expert Offloading for MoE Serving

12 minutes

Source: https://arxiv.org/html/2502.05370v1

...more

Share fMoE: Fine-Grained Expert Offloading for MoE Serving

Sign up to save your podcasts

fMoE: Fine-Grained Expert Offloading for MoE Serving

fMoE: Fine-Grained Expert Offloading for MoE Serving