Complexity of normalized stochastic first-order methods with momentum under heavy-tailed noise

Complexity of normalized stochastic first-order methods with momentum under heavy-tailed noise
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we propose practical normalized stochastic first-order methods with Polyak momentum, multi-extrapolated momentum, and recursive momentum for solving unconstrained optimization problems. These methods employ dynamically updated algorithmic parameters and do not require explicit knowledge of problem-dependent quantities such as the Lipschitz constant or noise bound. We establish first-order oracle complexity results for finding approximate stochastic stationary points under heavy-tailed noise and weakly average smoothness conditions – both of which are weaker than the commonly used bounded variance and mean-squared smoothness assumptions. Our complexity bounds either improve upon or match the best-known results in the literature. Numerical experiments are presented to demonstrate the practical effectiveness of the proposed methods.


💡 Research Summary

This paper addresses the challenge of stochastic first‑order optimization under heavy‑tailed noise, where the stochastic gradient estimator possesses a bounded α‑th central moment (α∈(1,2]) rather than a bounded variance. Classical analyses assume either bounded variance or almost‑sure Lipschitz continuity of the stochastic gradients, assumptions that are often violated in modern large‑scale machine‑learning applications. Moreover, gradient‑clipping techniques, while theoretically sound, require large clipping thresholds and prior knowledge of problem‑dependent constants such as the Lipschitz constant or noise bound, limiting their practical utility.

To overcome these limitations, the authors propose three practical normalized stochastic first‑order methods (SFOMs) that replace clipping with gradient normalization (i.e., scaling the stochastic gradient to unit norm) and incorporate momentum in three distinct ways: (1) Polyak momentum, (2) multi‑extrapolated momentum, and (3) recursive momentum. All three algorithms dynamically update step sizes η_k and momentum weights θ_k without needing explicit problem constants. When the tail exponent α is known, the step‑size and momentum schedules are tuned to exploit this knowledge; when α is unknown, a universal schedule is used, making the methods fully parameter‑free.

Algorithm 1 (Polyak momentum) generates a momentum vector m_k as a weighted average of the current normalized stochastic gradient and the previous momentum, then moves along the direction –m_k/‖m_k‖ with step size η_k. With η_k = (k+1)^{-(2α−1)/(3α−2)} and θ_k = (k+1)^{-α/(3α−2)} (α known), the method achieves an oracle complexity of
  O(ε^{-(3α−2)/(α−1)}).
If α is unknown, using η_k = (k+1)^{-3/4} and θ_k = (k+1)^{-1/2} yields a complexity of O(ε^{-2α/(α−1)}). Both bounds match or improve upon the best known results for normalized SGD under heavy‑tailed noise.

Algorithm 2 (multi‑extrapolated momentum) leverages higher‑order smoothness of the objective: the p‑th derivative of f is assumed L_p‑Lipschitz for some integer p≥2. By extrapolating past gradients with carefully chosen coefficients that decay with k, the algorithm attains a complexity of
  O(ε^{-(p(2α−1)+α−1)/(p(α−1))}).
When p=2, this already improves the exponent compared with the O(ε^{-(3α−2)/(α−1)}) bound, and the improvement grows with larger p. This is the first result showing that higher‑order smoothness can be exploited for acceleration even under heavy‑tailed noise.

Algorithm 3 (recursive momentum) assumes only an average‑Lipschitz condition on the stochastic gradients: E


Comments & Academic Discussion

Loading comments...

Leave a Comment