Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper presents a novel extension of neural scaling laws to Mixture-of-Experts (MoE) models, focusing on the optimal allocation of compute between expert and attention sub-layers. As MoE architectures have emerged as an efficient method for scaling model capacity without proportionally increasing computation, determining the optimal expert-attention compute ratio becomes critical. We define the ratio $r$ as the fraction of total FLOPs per token dedicated to the expert layers versus the attention layers, and explore how this ratio interacts with the overall compute budget and model sparsity. Through extensive experiments with GPT-style MoE Transformers, we empirically find that the optimal ratio $r^$ follows a power-law relationship with total compute and varies with sparsity. Our analysis leads to an explicit formula for $r^$, enabling precise control over the expert-attention compute allocation. We generalize the Chinchilla scaling law by incorporating this architectural parameter, providing a new framework for tuning MoE models beyond size and data. Our findings offer practical guidelines for designing efficient MoE models, optimizing performance while respecting fixed compute budgets.


💡 Research Summary

The paper introduces a novel scaling law that explicitly accounts for the allocation of compute between expert (feed‑forward) and attention sub‑layers in Mixture‑of‑Experts (MoE) Transformers. While MoE architectures allow massive parameter growth with near‑constant per‑token FLOPs, they also expose a new design knob: the FLOPs ratio $r = C_E / C_A$, where $C_E$ and $C_A$ are the FLOPs spent on expert and attention layers respectively. The authors also consider sparsity $S$, defined as the fraction of inactive experts, because the effectiveness of expert compute depends strongly on how many experts are active per token.

Theoretical motivation
The paper argues that additional compute allocated to either component yields diminishing returns, but the rate of diminishing returns for experts is sparsity‑dependent. Low sparsity (many active experts) spreads extra expert FLOPs across diverse subnetworks, giving higher marginal gains; high sparsity concentrates compute on a few experts, leading to quicker saturation. Attention compute, by contrast, is largely insensitive to sparsity. Consequently, the optimal ratio $r^{}$ must be a function of total training compute $C$ and sparsity $S$, and the authors propose a minimal functional form $r^{}(C,S)=\alpha(S) C^{\beta(S)}$.

Empirical methodology
To test this hypothesis, the authors conduct controlled sweeps over $r$ while keeping per‑token compute fixed, across multiple model scales (from ~30 M to >500 M parameters) and three sparsity regimes (≈ 82 %, 90 %, 95 %+ inactive experts). For each configuration they train models and record the final loss. The loss surface consistently shows a clear valley along the $r$ axis, confirming a well‑defined optimum $r^{*}$. As $C$ increases, the location of the valley shifts monotonically toward larger $r$, indicating that allocating a larger fraction of compute to experts becomes increasingly beneficial at scale. The shift is steeper for low‑sparsity models and more gradual for high‑sparsity ones.

Scaling law discovery
Plotting $r^{}$ against $C$ on log–log axes yields an approximately linear relationship for each sparsity level. Fitting a power‑law $r^{}= \alpha_r(S) C^{\beta_r(S)}$ provides an excellent fit. Moreover, the fitted coefficients themselves follow simple power‑law dependencies on the fraction of active experts $(1-S)$:

  • $\alpha_r = 6.7 \times 10^{-5} (1-S)^{-1.23}$
  • $\beta_r = 0.24 (1-S)^{0.21}$

Thus, as sparsity increases, the baseline ratio $\alpha_r$ shrinks while the exponent $\beta_r$ grows, capturing the observed slower growth of $r^{*}$ for highly sparse models.

Integration with loss scaling
The authors extend the classic Chinchilla loss scaling $L = a N^{\alpha} + b D^{\beta}$ (where $N$ is parameter count and $D$ is token count) by adding two penalty terms that quantify inefficiency due to sub‑optimal internal allocation:

  1. A term $d \frac{r}{r^{}+1}$ that penalizes deviation of the actual FLOPs ratio $r$ from its optimal value $r^{}$.
  2. A term $c , e^{R(1-S)^{\gamma}} N^{\lambda}$ that captures the extra cost of allocating too much compute to experts when sparsity is high (here $R$ is total FLOPs).

The full extended scaling law is: \


Comments & Academic Discussion

Loading comments...

Leave a Comment