Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression
Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively. Our method employs an online clustering procedure that periodically regroups experts using a fused metric of parameter and activation similarity, which stabilizes expert utilization. To our knowledge, this is one of the first frameworks to leverage the semantic embedding capability of the router to dynamically reconfigure the model’s architecture during training for substantial efficiency gains. Within each cluster, we decompose expert weights into a shared base matrix and extremely low-rank residual adapters, achieving up to fivefold parameter reduction per group while preserving specialization. This structure enables a two-stage hierarchical routing strategy: tokens are first assigned to a cluster, then to specific experts within it, drastically reducing the routing search space and the volume of all-to-all communication. Furthermore, a heterogeneous precision scheme, which stores shared bases in FP16 and residual factors in INT4, coupled with dynamic offloading of inactive clusters, reduces peak memory consumption to levels comparable to dense models. Evaluated on GLUE and WikiText-103, our framework matches the quality of standard MoE models while reducing total parameters by approximately 80%, improving throughput by 10% to 20%, and lowering expert load variance by a factor of over three. Our work demonstrates that structural reorganization is a principled path toward scalable, efficient, and memory-effective MoE LLMs. Code is available at https://github.com/szdtzpj/Breaking_the_moe_trilemma
💡 Research Summary
Mixture‑of‑Experts (MoE) large language models have become a popular way to scale model capacity without a proportional increase in compute, but in practice they suffer from a “trilemma”: load imbalance across experts, massive parameter redundancy due to replicating expert weights, and heavy all‑to‑all communication when routing tokens between devices. Existing work typically tackles one of these issues in isolation, which often worsens the others. This paper proposes a unified framework that simultaneously addresses all three challenges through dynamic expert clustering, structured intra‑group compression, hierarchical routing, and heterogeneous precision with dynamic offloading.
Dynamic dual‑similarity clustering
Each expert is represented by (i) a flattened weight vector and (ii) an activation centroid that is an exponential moving average of the token embeddings routed to that expert. Cosine similarity is computed separately on the weight vectors (S_param) and on the centroids (S_task). A fused similarity score S = α·S_param + (1‑α)·S_task (α∈
Comments & Academic Discussion
Loading comments...
Leave a Comment