A Generalized Version of Chung's Lemma and its Applications
Chung’s Lemma is a classical tool for establishing asymptotic convergence rates of (stochastic) optimization methods under strong convexity-type assumptions and appropriate polynomial diminishing step sizes. In this work, we develop a generalized version of Chung’s Lemma, which provides a simple non-asymptotic convergence framework for a more general family of step size rules. We demonstrate broad applicability of the proposed generalized lemma by deriving tight non-asymptotic convergence rates for a large variety of stochastic methods. In particular, we obtain partially new non-asymptotic complexity results for stochastic optimization methods, such as Stochastic Gradient Descent (SGD) and Random Reshuffling (RR), under a general $(θ,μ)$-Polyak-Lojasiewicz (PL) condition and for various step sizes strategies, including polynomial, constant, exponential, and cosine step sizes rules. Notably, as a by-product of our analysis, we observe that exponential step sizes exhibit superior adaptivity to both landscape geometry and gradient noise; specifically, they achieve optimal convergence rates without requiring exact knowledge of the underlying landscape or separate parameter selection strategies for noisy and noise-free regimes. Our results demonstrate that the developed variant of Chung’s Lemma offers a versatile, systematic, and streamlined approach to establish non-asymptotic convergence rates under general step size rules.
💡 Research Summary
The paper revisits Chung’s Lemma—a classic tool for analyzing convergence of stochastic approximation algorithms—and extends it to a much broader class of step‑size schedules. The original lemma deals with recursions of the form
(a_{k+1}\le (1-c/k^{p})a_k+d/k^{p+q})
and yields asymptotic rates only for polynomially diminishing step sizes. The authors propose a generalized recursion
(a_{k+1}\le\bigl(1-\frac{1}{s(b_k)}\bigr)a_k+\frac{1}{t(b_k)})
where (s(\cdot)) and (t(\cdot)) are positive, continuously differentiable functions of a scalar sequence ({b_k}) that encodes the step‑size rule (e.g., (\alpha_k), decay parameters, etc.). The key structural assumption is that the ratio (r(b)=s(b)/t(b)) is convex on a suitable interval. Under this mild condition, Theorem 2.1 (Generalized Chung’s Lemma) decomposes the convergence bound into two components:
- S‑induced rate – the deterministic contraction contributed by the product (\prod_{i=0}^{k}(1-1/s(b_i))). This term typically decays rapidly and captures the “ideal” behavior when noise is absent.
- T‑induced rate – the contribution of the error term (1/t(b_k)), which models stochastic gradient noise, variance, or bias.
Combining the two yields a clean non‑asymptotic bound
(a_k = O!\bigl(s(b_k)/t(b_k)\bigr)=O!\bigl(r(b_k)\bigr)).
When (s(b_k)=k^{p}) and (t(b_k)=k^{p+q}) the bound reduces to the classic (O(k^{-q})), but the result now holds for exponential, cosine, constant, or any other schedule that can be expressed via suitable (s) and (t).
The authors then apply this framework to stochastic gradient descent (SGD) and Random Reshuffling (RR) under a generalized ((\theta,\mu))-Polyak‑Łojasiewicz (PL) condition:
(|\nabla f(x)|^{2}\ge 2\mu (f(x)-f^{\star})^{\theta}) with (\theta\in
Comments & Academic Discussion
Loading comments...
Leave a Comment