Mixture-of-Experts (MoE) models achieve efficiency through sparse activation, but the role of geometric regularization in expert specialization remains unclear. We apply orthogonality loss to enforce expert diversity and find it fails on multiple fronts: it does not reduce weight-space overlap (MSO actually increases by up to 114%), activation-space overlap remains high (∼0.6) regardless of regularization, and effects on performance are inconsistent-marginal improvement on WikiText-103 (-0.9%), slight degradation on TinyStories (+0.9%), and highly variable results on PTB (std > 1.0). Our analysis across 7 regularization strengths reveals no significant correlation (r = -0.293, p = 0.523) between weight and activation orthogonality. These findings demonstrate that weight-space regularization neither achieves its geometric goal nor reliably improves performance, making it unsuitable for MoE diversity.
Mixture-of-Experts (MoE) models scale efficiently by activating only a subset of parameters per input (Shazeer et al., 2017;Fedus et al., 2022). A common assumption is that expert representations should be orthogonal to minimize interference (Chen et al., 2022(Chen et al., , 2023)). This intuition stems from linear algebra: orthogonal vectors are maximally distinguishable and their outputs do not interfere when combined.
Hypothesis. Orthogonality regularization should improve expert diversity and reduce perplexity.
Finding. It does not-and is unreliable. Across three datasets (TinyStories, WikiText-103, PTB), geometric regularization yields inconsistent results: marginal improvement on WikiText-103 (-0.9%), slight degradation on TinyStories (+0.9%), and high variance on PTB (std > 1.0).
We identify a Weight-Activation Gap: weight-space orthogonality (MSO ≈ 10 -4 ) does not translate to activation-space orthogonality (MSO ≈ 0.6). Across 7 regularization strengths, we find no significant correlation between weight and activation overlap (r = -0.293, p = 0.523), indicating that weight geometry and functional orthogonality are largely independent.
We show that orthogonality regularization fails to reduce weight MSO-it actually increases it by up to 114%-and yields inconsistent effects on loss: marginal improvement on WikiText-103 (-0.9%), slight degradation on TinyStories (+0.9%), and high variance on PTB (std > 1.0).
We identify a weight-activation disconnect: activation overlap is ∼1000× higher than weight overlap, with no significant correlation (r = -0.293, p = 0.523, n=7).
We demonstrate that weight-space regularization is an unreliable optimization target-it neither achieves its geometric goal nor reliably improves performance.
Orthogonality Loss. For expert weight matrices {W i } N i=1 , we define:
where Wi = vec(W i )/∥vec(W i )∥ is the normalized flattened weight vector. This loss encourages orthogonality among expert representations and is added to the language modeling objective with weight λ.
Mean Squared Overlap (MSO). We measure geometric diversity using:
Lower MSO indicates more orthogonal (diverse) experts. We compute MSO for both weights and activations.
Activation MSO. For co-activated experts producing outputs {h i }, we compute:
where S(x) is the set of k selected experts for input x. This measures functional similarity between expert outputs on actual inputs.
Setup. We train NanoGPT-MoE (∼130M parameters, 8 experts, 6 layers, top-2 routing) on TinyStories (Eldan and Li, 2023) for 10K iterations with AdamW (Loshchilov and Hutter, 2019) (lr=5 × 10 -4 , β 1 =0.9, β 2 =0.95, weight decay=0.1). Each MoE layer contains 8 experts with hidden dimension 512 and intermediate dimension 2048. TinyStories experiments use 5 random seeds (42,123,456,789,1337).
Implementation Details. We regularize the upprojection weights (W up ∈ R d ffn ×d model ) of each expert. Each weight matrix is flattened and L2-normalized before computing pairwise inner products. The λ sweep uses 7 values: {0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2}. MSO is computed per layer and averaged across all 6 MoE layers. Activation MSO is computed on the post-gating expert outputs for the top-2 selected experts, unweighted by gating scores. We do not use auxiliary load balancing loss.
Despite the explicit regularization objective, perplexity improvements are not statistically significant (Table 2).
Why Are p-values High? The high p-value (p = 0.727) reflects both minimal effect size and increased variance. The baseline shows low std (0.08), while λ=0.001 increases variance to 0.32a 4× increase that destabilizes training. The slight PPL increase (+0.9%) is dwarfed by this variance, indicating the regularization adds noise without benefit.
Table 3 reveals the core finding: weight and activation geometry are fundamentally decoupled.
Correlation Analysis. Figure 1 visualizes the disconnect: as λ increases, weight MSO rises (regularization is applied), but activation MSO remains flat at ∼0.57. Across 7 regularization strengths, we find Pearson r = -0.293 (p = 0.523, 95% CI: [-0.857, 0.590])-not statistically significant. This confirms that weight and activation geometry are independent.
To test whether our findings generalize beyond TinyStories, we evaluate orthogonality regularization on WikiText-103 (Merity et al., 2016) a small, consistent improvement (-0.9%). However, on PTB (1.2M tokens), results are highly variable across seeds (std ∼ 1.0), making conclusions unreliable. This high variance suggests that the effectiveness of geometric regularization may depend on dataset-seed interactions rather than dataset characteristics alone.
Interpretation. The high variance on PTB (std ∼1.0) compared to WikiText-103 (std ∼0.05) may reflect dataset-scale effects. Smaller datasets may exhibit more seed-dependent expert specialization patterns, leading to unstable outcomes. Regardless of direction, the inconsistency itself u
This content is AI-processed based on open access ArXiv data.