AI: post transformers

fMoE: Fine-Grained Expert Offloading for MoE Serving


Listen Later

This February 2025 paper introduces fMoE, a novel fine-grained expert offloading system designed to optimize the serving efficiency of Mixture-of-Experts (MoE) Large Language Models (LLMs). The paper highlights the memory inefficiency of current MoE-based LLMs during inference due to inactive experts residing in GPU memory and the limitations of existing coarse-grained offloading solutions that struggle with latency-memory trade-offs. fMoE addresses these challenges by tracking iteration-level expert probability distributions through "expert maps" and leveraging input semantic embeddings to intelligently guide expert prefetching, caching, and offloading decisions. Experiments show that fMoE significantly reduces inference latency and improves expert hit rates compared to state-of-the-art methods.


Source: https://arxiv.org/html/2502.05370v1

...more
View all episodesView all episodes
Download on the App Store

AI: post transformersBy mcgrof