Fast Frank--Wolfe Algorithms with Adaptive Bregman Step-Size for Weakly Convex Functions
We propose Frank–Wolfe (FW) algorithms with an adaptive Bregman step-size strategy for smooth adaptable (also called: relatively smooth) (weakly-) convex functions. This means that the gradient of the objective function is not necessarily Lipschitz continuous, and we only require the smooth adaptable property. Compared with existing FW algorithms, our assumptions are less restrictive. We establish convergence guarantees in various settings, including convergence rates ranging from sublinear to linear, depending on the assumptions for convex and nonconvex objective functions. Assuming that the objective function is weakly convex and satisfies the local quadratic growth condition, we provide both local sublinear and local linear convergence with respect to the primal gap. We also propose a variant of the away-step FW algorithm using Bregman distances over polytopes. We establish faster global convergence (up to a linear rate) for convex optimization under the Hölder error bound condition and local linear convergence for nonconvex optimization under the local quadratic growth condition. Numerical experiments demonstrate that our proposed FW algorithms outperform existing methods.
💡 Research Summary
The paper tackles the constrained optimization problem min_{x∈P} f(x) where the feasible set P is a compact convex set (often a polytope) and the objective f may be smooth, weakly convex, or even non‑convex. Classical Frank‑Wolfe (FW) methods rely on two restrictive assumptions: (i) the gradient of f is Lipschitz continuous (L‑smoothness) and (ii) f is convex (or strongly convex for linear rates). This work relaxes both assumptions by introducing a Bregman‑distance based framework and an adaptive step‑size rule.
Key technical ingredients
-
Kernel generating distance ϕ and Bregman distance D_ϕ – ϕ is a strictly convex “kernel” defined on an open convex set C, and the associated Bregman distance D_ϕ(x,y)=ϕ(x)−ϕ(y)−⟨∇ϕ(y),x−y⟩ measures proximity. A crucial inequality (2.1) holds: D_ϕ((1−γ)x+γy, x) ≤ γ^{1+ν} D_ϕ(y, x) for some ν>0, which captures the curvature of the distance.
-
L‑smooth adaptability (L‑smad) – The pair (f, ϕ) is L‑smad if both Lϕ−f and Lϕ+f are convex on C. This generalizes L‑smoothness (recovered when ϕ=½‖·‖²) and includes many functions whose gradients are not Lipschitz, such as −log x, ¼ x⁴, ℓ_p losses with p≠2, and objective functions arising in non‑negative matrix factorization, phase retrieval, and blind deconvolution.
-
Adaptive Bregman step‑size – Instead of fixing a step‑size based on a known Lipschitz constant, the algorithm estimates the effective constant L and the scaling exponent ν on the fly. The step‑size γ_t is chosen as the largest value in
Comments & Academic Discussion
Loading comments...
Leave a Comment