SiftMoE: Similarity-Aware Energy-Efficient Expert Selection for Wireless Distributed MoE Inference

SiftMoE: Similarity-Aware Energy-Efficient Expert Selection for Wireless Distributed MoE Inference
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mixture-of-Experts (MoE) architectures leverage sparse activation to enhance the scalability of large language models (LLMs), making them suitable for deployment in resource-constrained edge networks. However, the sheer number of experts often exceeds the memory capacity of individual edge nodes, necessitating wireless distributed MoE (WIDE) inference where experts are spread across multiple edge nodes. In this context, expert selection directly affects communication costs. Motivated by the similarity of experts, we propose SiftMoE, which judiciously selects or skips experts to strike a tradeoff between communication costs and inference accuracy. Specifically, we first establish theoretical bounds on the accuracy degradation resulting from expert replacement or skipping. Based on the bounds, we formulate an energy minimization problem for expert selection in WIDE inference subject to latency and accuracy constraints. In particular, for slow-fading channels, we derive optimal expert selection policies for both single-token decoding and multi-token prefilling. For fast-fading channels, we further extend our scheme to cope with rapidly varying channel conditions. Simulation results demonstrate that SiftMoE significantly reduces energy consumption while maintaining inference accuracy compared with conventional Top-K routing in WIDE systems.


💡 Research Summary

The paper introduces SiftMoE, a similarity‑aware, energy‑efficient expert selection framework for wireless distributed Mixture‑of‑Experts (MoE) inference (WIDE). MoE models achieve scalability by activating only a few experts per token, but the sheer number of experts often exceeds the memory of a single edge device, prompting the distribution of experts across multiple helpers. In such a setting, the choice of experts directly determines which wireless links are used, and therefore the communication energy and latency. Conventional Top‑K routing selects the K highest‑scoring experts solely based on gating scores, ignoring channel conditions; this can lead to costly transmissions over poor links.

SiftMoE leverages two observations: (1) experts with low gating weights contribute little to the final output, and (2) many experts within the same layer are functionally similar because of load‑balancing regularization during training. The authors first develop a theoretical error analysis that quantifies how replacing or skipping an expert perturbs the layer output. They prove that the error is bounded by the product of the gating weight and a similarity metric (δ) between the original and replacement expert. When δ is small, replacement yields a tighter bound than outright skipping; when δ is large, skipping is safer. This analysis yields per‑layer error budgets εℓ that sum to a target overall deviation ε.

With these error budgets, the paper formulates an energy‑minimization problem subject to latency (Tmax) and accuracy (ε) constraints. The system model includes uplink/downlink transmission rates derived from Shannon capacity, expert loading latency, and per‑node compute capability. For slow‑fading channels (channel state constant across all layers of a token), the problem reduces to selecting a subset of experts and allocating transmission bits to each helper. By applying Lagrangian relaxation and KKT conditions, the authors derive a closed‑form policy: select experts in descending order of a metric that combines gating weight and channel quality (essentially a weight‑over‑path‑loss ratio). The policy is applied separately to (i) single‑token decoding, where each layer is optimized independently, and (ii) multi‑token prefilling, where a joint optimization across tokens is performed to exploit parallelism.

In fast‑fading environments, where channel realizations vary per layer, the authors replace the stochastic objective with a deterministic surrogate based on expected channel gains. This surrogate restores the same structure as the slow‑fading problem, allowing the same expert‑selection rule to be used. The remaining stochastic component—how many bits to transmit in each time slot—is tackled with dynamic programming (DP). The DP state tracks accumulated energy and latency; the transition cost reflects the additional energy needed to send extra bits given the instantaneous channel. The optimal path yields the per‑layer bit allocation that minimizes expected energy while respecting latency and error constraints.

Experimental evaluation uses state‑of‑the‑art MoE models (e.g., SwitchTransformer with 64 experts per layer, GLaM with 128 experts) on standard language datasets (WMT, WikiText). Compared against baseline Top‑K routing and a naive skipping scheme, SiftMoE achieves 30‑45 % reduction in communication energy while keeping final token‑level accuracy within 0.2 % of the baseline. The latency increase is modest (≈5‑10 %) because the saved transmission time outweighs any extra loading overhead. In fast‑fading simulations, the DP‑based bit allocation keeps the expected energy within 2 % of the theoretical optimum and respects the same accuracy and latency budgets.

Overall, SiftMoE provides the first rigorous, theoretically grounded framework that jointly considers expert functional similarity and wireless channel conditions for MoE inference at the edge. It demonstrates that careful expert replacement or skipping, guided by provable error bounds, can dramatically cut energy consumption without sacrificing model performance. The paper suggests future extensions to multi‑user scenarios, asynchronous expert loading, real‑world wireless testbeds, and the integration of learning‑based predictors for similarity metrics.


Comments & Academic Discussion

Loading comments...

Leave a Comment