Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime
Continuous-time models provide important insights into the training dynamics of optimization algorithms in deep learning. In this work, we establish a non-asymptotic convergence analysis of stochastic gradient Langevin dynamics (SGLD), which is an Itô stochastic differential equation (SDE) approximation of stochastic gradient descent in continuous time, in the lazy training regime. We show that, under regularity conditions on the Hessian of the loss function, SGLD with multiplicative and state-dependent noise (i) yields a non-degenerate kernel throughout the training process with high probability, and (ii) achieves exponential convergence to the empirical risk minimizer in expectation, and we establish finite-time and finite-width bounds on the optimality gap. We corroborate our theoretical findings with numerical examples in the regression setting.
💡 Research Summary
The paper studies stochastic gradient Langevin dynamics (SGLD) in the so‑called lazy‑training regime, where an output‑scaling factor α is chosen large enough that the parameters stay in a small neighbourhood of their random initialization throughout training. In this regime the neural tangent kernel (NTK) remains well‑conditioned, which is a key ingredient for the recent deterministic lazy‑training analyses of over‑parameterized networks.
Problem setting and SGLD model
The authors consider a smooth predictor h(ω):ℝ^p→ℱ (ℱ a Hilbert space) and a smooth loss ℓ(x,h). The empirical risk R(h)=E_x
Comments & Academic Discussion
Loading comments...
Leave a Comment