Stochastic Bilevel Optimization with Heavy-Tailed Noise

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper considers the smooth bilevel optimization in which the lower-level problem is strongly convex and the upper-level problem is possibly nonconvex. We focus on the stochastic setting where the algorithm can access the unbiased stochastic gradient evaluation with heavy-tailed noise, which is prevalent in many machine learning applications, such as training large language models and reinforcement learning. We propose a nested-loop normalized stochastic bilevel approximation (N$^2$SBA) for finding an $ε$-stationary point with the stochastic first-order oracle (SFO) complexity of $\tilde{\mathcal{O}}\big(κ^{\frac{7p-3}{p-1}} σ^{\frac{p}{p-1}} ε^{-\frac{4 p - 2}{p-1}}\big)$, where $κ$ is the condition number, $p\in(1,2]$ is the order of central moment for the noise, and $σ$ is the noise level. Furthermore, we specialize our idea to solve the nonconvex-strongly-concave minimax optimization problem, achieving an $ε$-stationary point with the SFO complexity of~$\tilde{\mathcal O}\big(κ^{\frac{2p-1}{p-1}} σ^{\frac{p}{p-1}} ε^{-\frac{3p-2}{p-1}}\big)$. All the above upper bounds match the best-known results under the special case of the bounded variance setting, i.e., $p=2$. We also conduct the numerical experiments to show the empirical superiority of the proposed methods.

💡 Research Summary

The paper addresses stochastic bilevel optimization where the lower‑level problem is strongly convex and the upper‑level problem may be non‑convex, under the realistic assumption that stochastic gradients possess heavy‑tailed noise. Instead of the common bounded‑variance (p = 2) assumption, the authors consider a p‑th bounded central moment (p‑BCM) condition with p∈(1, 2], allowing the variance to be infinite while only the p‑th moment is finite. This setting captures the heavy‑tailed gradient distributions observed in large‑scale machine learning tasks such as large language model training and reinforcement learning.

The core contribution is a nested‑loop algorithm called N²SBA (Nested‑Loop Normalized Stochastic Bilevel Approximation). The algorithm first reformulates the bilevel problem via a penalized surrogate L*λ(x) = min_y f(x,y)+λ

Stochastic Bilevel Optimization with Heavy-Tailed Noise

💡 Research Summary

Comments & Academic Discussion

Leave a Comment