Shuffling the Stochastic Mirror Descent via Dual Lipschitz Continuity and Kernel Conditioning
The global Lipschitz smoothness condition underlies most convergence and complexity analyses via two key consequences: the descent lemma and the gradient Lipschitz continuity. How to study the performance of optimization algorithms in the absence of Lipschitz smoothness remains an active area. The relative smoothness framework from Bauschke-Bolte-Teboulle (2017) and Lu-Freund-Nesterov (2018) provides an extended descent lemma, ensuring convergence of Bregman-based proximal gradient methods and their vanilla stochastic counterparts. However, many widely used techniques (e.g., momentum schemes, random reshuffling, and variance reduction) additionally require the Lipschitz-type bound for gradient deviations, leaving their analysis under relative smoothness an open area. To resolve this issue, we introduce the dual kernel conditioning (DKC) regularity condition to regulate the local relative curvature of the kernel functions. Combined with the relative smoothness, DKC provides a dual Lipschitz continuity for gradients: even though the gradient mapping is not Lipschitz in the primal space, it preserves Lipschitz continuity in the dual space induced by a mirror map. We verify that DKC is widely satisfied by popular kernels and is closed under affine composition and conic combination. With these novel tools, we establish the first complexity bounds as well as the iterate convergence of random reshuffling mirror descent for constrained nonconvex relative smooth problems.
💡 Research Summary
The paper addresses a fundamental gap in the analysis of stochastic optimization algorithms for non‑convex problems that lack global Lipschitz smoothness. While the relative smoothness framework supplies an extended descent lemma based on a Bregman divergence generated by a kernel h, it does not provide a Lipschitz‑type bound on gradient differences, which is essential for advanced techniques such as momentum, random reshuffling, and variance reduction.
To fill this gap the authors introduce Dual Kernel Conditioning (DKC), a regularity condition on the kernel h. Assuming h is separable into blocks, DKC requires that for any fixed dual‑space diameter δ (measured by ρ_h(x,y)=‖∇h(x)−∇h(y)‖) there exists a uniform bound κ_δ on the block‑wise condition numbers κ_j = L_j/μ_j, where L_j and μ_j are the maximal and minimal eigenvalues of the block Hessians over the region. The condition is shown to hold for a broad family of kernels, including Shannon entropy, Burg entropy (regularized), Fermi‑Dirac entropy, exponential kernels, and power‑type kernels. Moreover, DKC is closed under scaling, compatible affine transformations, and conic combinations, guaranteeing that complex kernel constructions inherit the property.
Combining DKC with relative smoothness (L‑smoothness relative to h) yields a dual Lipschitz continuity of the objective gradient:
‖∇f(x)−∇f(y)‖ ≤ L·κ_δ·ρ_h(x,y) for all x,y in a region whose dual diameter does not exceed δ.
Unlike the classical Euclidean Lipschitz bound, this inequality is expressed in the dual space induced by the mirror map ∇h. Consequently, even when ∇f is not Lipschitz in the primal space, it behaves Lipschitz‑like with respect to the non‑Euclidean distance ρ_h. This property enables the control of stochastic error terms in algorithms that rely on gradient deviations.
Armed with these tools, the authors design Random‑Shuffling Mirror Descent (RRMD). In each epoch the data indices are permuted without replacement, and the mirror update is performed as
x_{k+1}=∇h⁎(∇h(x_k)−α_k∇f_{i_k}(x_k)),
where α_k is a stepsize chosen so that consecutive iterates stay within a dual‑diameter δ. Using the dual Lipschitz bound, the variance introduced by the random permutation can be bounded by O(α_k·κ_δ). This leads to a sample‑complexity improvement: to obtain an ε‑stationary point (‖∇f( x̄ )‖ ≤ ε) the algorithm requires O(ε^{‑1.5}) stochastic gradient evaluations, matching the best known rates for random reshuffling in convex settings and improving over the O(ε^{‑2}) rate typical for vanilla stochastic mirror descent in non‑convex problems.
A further contribution is the last‑iterate convergence result. When the objective f is definable (i.e., belongs to an o‑minimal structure) and the kernel satisfies DKC, the full sequence {x_k} generated by RRMD converges to a critical point of f, even when the feasible set Z is a proper subset of dom h. This removes the restrictive assumption Z = dom h = ℝ^d that appears in prior relative‑smoothness literature, thereby extending the applicability to entropy‑based constraints and other non‑Euclidean domains.
In summary, the paper makes three major advances: (1) it proposes the Dual Kernel Conditioning regularity, a versatile condition satisfied by many practical kernels; (2) it shows that DKC together with relative smoothness yields a dual‑space Lipschitz continuity of gradients, enabling the analysis of stochastic methods that need gradient deviation bounds; (3) it leverages these insights to prove O(ε^{‑1.5}) complexity and full‑sequence convergence for Random‑Shuffling Mirror Descent on constrained non‑convex problems. The work thus bridges the gap between relative‑smoothness theory and modern stochastic acceleration techniques, opening the door for efficient, provably convergent algorithms in a wide range of non‑Euclidean, non‑convex optimization tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment