Clarifying Shampoo: Adapting Spectral Descent to Stochasticity and the Parameter Trajectory
Optimizers leveraging the matrix structure in neural networks, such as Shampoo and Muon, are more data-efficient than element-wise algorithms like Adam and Signum. While in specific settings, Shampoo and Muon reduce to spectral descent analogous to how Adam and Signum reduce to sign descent, their general relationship and relative data efficiency under controlled settings remain unclear. Through extensive experiments on language models, we demonstrate that Shampoo achieves higher token efficiency than Muon, mirroring Adam’s advantage over Signum. We show that Shampoo’s update applied to weight matrices can be decomposed into an adapted Muon update. Consistent with this, Shampoo’s benefits can be exclusively attributed to its application to weight matrices, challenging interpretations agnostic to parameter shapes. This admits a new perspective that also avoids shortcomings of related interpretations based on variance adaptation and whitening: rather than enforcing semi-orthogonality as in spectral descent, Shampoo’s updates are time-averaged semi-orthogonal in expectation.
💡 Research Summary
The paper investigates the relationship between two matrix‑aware optimizers—Shampoo and Muon—and their element‑wise counterparts, Adam and Signum. While Adam can be viewed as an element‑wise scaled version of Signum, the authors demonstrate that Shampoo can be decomposed into a Muon‑like matrix sign update multiplied by left‑ and right‑hand adaptation matrices. This structural analogy explains why Shampoo consistently outperforms Muon in token‑efficiency, mirroring the advantage of Adam over Signum.
The authors first revisit the Adam‑Signum connection, showing that Adam’s update can be written as “adaptation ⊙ sign”, where the adaptation term (derived from the EMA of squared gradients) scales the sign of the EMA of the gradient. They argue that this adaptation is responsible for Adam’s superior performance.
Shampoo is introduced as a Kronecker‑factored preconditioner for full‑matrix Adagrad. Its update involves two factor matrices Lₜ and Rₜ, which are EMA’s of GₜGₜᵀ and GₜᵀGₜ respectively, and a power‑p inverse applied to each side of the gradient. When the EMA decay β₂, the epsilon regularizer ε, and the power p are set to zero (or p = ¼/½), Shampoo reduces exactly to spectral descent: the update becomes the polar factor (UₜVₜᵀ) of the gradient’s SVD, i.e., the matrix analogue of the sign function.
Crucially, the paper derives a decomposition (Equation 10) that rewrites the Shampoo step as:
(Lₜ)^{‑p} Mₜ (Rₜ)^{‑p} = (Lₜ)^{‑p}(MₜMₜᵀ)^{¼} · UₜVₜᵀ · (RₜMₜᵀMₜ)^{¼},
where Mₜ is the EMA of the gradient and UₜVₜᵀ is the Muon matrix sign. The left and right factors play the same role as Adam’s element‑wise variance adaptation, providing a scaling that mitigates stochastic noise.
To test the hypothesis that these adaptations improve Muon, the authors conduct extensive language‑model experiments using Llama‑3 variants (320 M and 1.5 B parameters) trained on the C4 dataset. They evaluate token budgets corresponding to 1× and 8× the Chinchilla recommendation, batch sizes of 64 and 256, and sweep learning rates, EMA hyper‑parameters (β₁, β₂), and ε for Shampoo. All optimizers share the same PyTorch Distributed Shampoo codebase, differ only in their preconditioner or matrix‑sign operation, and use Adam‑based grafting for magnitude control.
Results (Table 1) show that every Shampoo variant (both ¼‑ and ½‑power versions, as well as KL‑Shampoo) matches or surpasses Muon with SVD in validation perplexity across all settings. The advantage ranges from roughly 5 % to 30 % fewer tokens needed to reach a given perplexity. Shampoo’s benefit is more pronounced at larger batch sizes, suggesting a higher critical batch size for matrix‑based methods, consistent with prior observations. Scaling Muon with classic or “Moonlight” layer‑wise factors does not beat grafting (Table 3). Using the Newton‑Schulz iteration for Muon and the default ε for Shampoo yields comparable numerical stability, confirming that the observed performance gap is algorithmic rather than numerical (Table 4).
The authors also discuss KL‑Shampoo, which minimizes the KL‑divergence between a Gaussian model of the gradient and a Kronecker‑structured approximation. While KL‑Shampoo converges faster in some regimes, it underperforms Shampoo ½ at smaller batch sizes, indicating sensitivity to batch‑size and perhaps to the quality of the KL approximation.
Finally, the paper proposes a new interpretation of Shampoo: rather than enforcing exact semi‑orthogonality (as spectral descent does), Shampoo’s updates are “time‑averaged semi‑orthogonal in expectation.” The EMA of the factor matrices ensures that, on average, the left and right preconditioners bring the gradient close to a semi‑orthogonal matrix, while the adaptation matrices damp stochastic variance. This viewpoint sidesteps earlier explanations based on variance adaptation or whitening and highlights that the core advantage of Shampoo stems from the combination of a matrix‑sign operation (Muon) and adaptive scaling on both sides.
In summary, the work clarifies that Shampoo’s superiority over Muon is analogous to Adam’s over Signum, rooted in a two‑sided adaptive scaling of a matrix‑sign update. The extensive empirical study validates this theory across model sizes, token budgets, and batch sizes, and introduces a fresh conceptual lens—time‑averaged semi‑orthogonality—that may guide future design of matrix‑aware optimizers.
Comments & Academic Discussion
Loading comments...
Leave a Comment