Acceleration for Polyak-Łojasiewicz Functions with a Gradient Aiming Condition
It is known that when minimizing smooth Polyak-Łojasiewicz (PL) functions, momentum algorithms cannot significantly improve the convergence bound of gradient descent, contrasting with the acceleration phenomenon occurring in the strongly convex case. To bridge this gap, the literature has proposed strongly quasar-convex functions as an intermediate non-convex class, for which accelerated bounds have been suggested to persist. We show that this is not true in general: the additional structure of strong quasar-convexity does not suffice to guaranty better worst-case bounds for momentum compared to gradient descent. As an alternative, we study PL functions under an aiming condition that measures how well the descent direction points toward a minimizer. This perspective clarifies the geometric ingredient enabling provable acceleration by momentum when minimizing PL functions.
💡 Research Summary
**
This paper revisits the theoretical benefits of momentum, specifically Nesterov acceleration, for non‑convex optimization. While it is well‑known that momentum yields provable acceleration for smooth, strongly‑convex functions, recent work has suggested that a broader class—strongly quasar‑convex (SQC) functions—might also enjoy accelerated rates. The authors demonstrate that this belief is unfounded: SQC does not guarantee better worst‑case convergence than plain gradient descent (GD).
The key observation is that any SQC function automatically satisfies a Polyak‑Łojasiewicz (PL) inequality with some constant µ′ that can be substantially larger (or smaller) than the µ appearing in the SQC definition. Consequently, the PL‑based GD rate µ′/L can dominate the SQC‑based Nesterov rate τ·p·µ/L, and vice‑versa, depending on the specific function. Numerical examples illustrate both possibilities, showing that SQC alone cannot be used to claim acceleration. Moreover, the authors point out a conceptual pitfall: the parameter µ in SQC does not play the same curvature‑lower‑bound role as in strong convexity, so interpreting L/µ as a condition number is misleading.
To identify a structural property that truly enables acceleration, the paper introduces an “aiming condition”:
⟨∇f(x), x−x*⟩ ≥ a ‖∇f(x)‖ ‖x−x*‖ with 0 < a ≤ 1.
This inequality quantifies how well the gradient points toward a global minimizer x*. When a is sufficiently large, the authors prove that a continuized, stochastic parameterization of Nesterov’s method yields a linear convergence rate of the form (1 − c·a·µ/L)^t, i.e., a √a‑factor improvement over the standard PL‑GD rate µ/L. The proof relies on a Lyapunov function that combines the objective gap with the distance of the auxiliary momentum variable to the optimum; the aiming condition guarantees a sufficient decrease in this Lyapunov function along the Nesterov ODE.
Recognizing that the aiming condition may not hold at every iteration in practice, the authors further relax it to an average version: the inequality must hold in expectation over a sufficiently large fraction of the trajectory. Under this weaker assumption they recover the same accelerated rate, showing that occasional misalignment does not destroy the overall speed‑up. This result aligns with empirical observations in deep learning, where gradients are often well‑aligned on average even if they occasionally point away from the optimum.
To illustrate the limits of momentum without alignment, the paper constructs a two‑dimensional PL function with a unique minimizer that violates the aiming condition in a region of the domain. Experiments reveal that Nesterov’s method initially drives iterates away from the optimum, causing GD to outperform momentum during the early phase. Only after the iterates enter a region where the aiming condition holds does the accelerated behavior emerge. This example provides concrete evidence that the geometry of the gradient field, not merely the PL inequality, determines whether momentum can be beneficial.
In summary, the contributions are threefold: (1) a rigorous counter‑example showing that SQC does not guarantee accelerated convergence; (2) the introduction of a geometric aiming condition that, when satisfied (even on average), enables provable acceleration for PL functions; and (3) empirical validation that violating this condition can make momentum detrimental. The work clarifies the precise structural requirement beyond PL needed for momentum‑based acceleration, offering a more nuanced theoretical foundation for the widespread empirical success of momentum in non‑convex machine‑learning problems. Future directions include adaptive estimation of the alignment constant a, extensions to stochastic gradients, and handling multiple minima.
Comments & Academic Discussion
Loading comments...
Leave a Comment