Adaptive Batch Sizes Using Non-Euclidean Gradient Noise Scales for Stochastic Sign and Spectral Descent
To maximize hardware utilization, modern machine learning systems typically employ large constant or manually tuned batch size schedules, relying on heuristics that are brittle and costly to tune. Existing adaptive strategies based on gradient noise scale (GNS) offer a principled alternative. However, their assumption of SGD’s Euclidean geometry creates a fundamental mismatch with popular optimizers based on generalized norms, such as signSGD / Signum ($\ell_\infty$) and stochastic spectral descent (specSGD) / Muon ($\mathcal{S}_\infty$). In this work, we derive gradient noise scales for signSGD and specSGD that naturally emerge from the geometry of their respective dual norms. To practically estimate these non-Euclidean metrics, we propose an efficient variance estimation procedure that leverages the local mini-batch gradients on different ranks in distributed data-parallel systems. Our experiments demonstrate that adaptive batch size strategies using non-Euclidean GNS enable us to match the validation loss of constant-batch baselines while reducing training steps by up to 66% for Signum and Muon on a 160 million parameter Llama model.
💡 Research Summary
**
The paper addresses a critical gap in adaptive batch‑size scheduling for modern deep‑learning optimizers that operate under non‑Euclidean geometries. Existing adaptive strategies rely on the Gradient Noise Scale (GNS) derived for stochastic gradient descent (SGD) under the Euclidean ℓ₂ norm. While effective for SGD, these metrics are mismatched for optimizers such as signSGD/Signum (which use the ℓ_∞ norm) and stochastic spectral descent (specSGD) or its modern incarnation Muon (which use the Schatten‑∞ norm). The authors propose a principled extension of GNS to these non‑Euclidean settings by measuring noise in the dual norm of the optimizer’s geometry: ℓ₁ for sign‑based methods and the nuclear norm (S₁) for spectral methods.
The theoretical development begins with a generalized steepest‑descent framework: at each iteration a mini‑batch gradient gₖ is computed, and the update direction pₖ is the maximizer of ⟨gₖ, p⟩ under a chosen norm constraint. Lemma 3.1 shows that the expected alignment between the true gradient ∇L and the stochastic direction satisfies
E
Comments & Academic Discussion
Loading comments...
Leave a Comment