All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension

All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study the sample complexity of the best-case Empirical Risk Minimizer in the setting of stochastic convex optimization. We show that there exists an instance in which the sample size is linear in the dimension, learning is possible, but the Empirical Risk Minimizer is likely to be unique and to overfit. This resolves an open question by Feldman. We also extend this to approximate ERMs. Building on our construction we also show that (constrained) Gradient Descent potentially overfits when horizon and learning rate grow w.r.t sample size. Specifically we provide a novel generalization lower bound of $Ω\left(\sqrt{ηT/m^{1.5}}\right)$ for Gradient Descent, where $η$ is the learning rate, $T$ is the horizon and $m$ is the sample size. This narrows down, exponentially, the gap between the best known upper bound of $O(ηT/m)$ and existing lower bounds from previous constructions.


💡 Research Summary

The paper investigates the sample‑complexity and generalization behavior of Empirical Risk Minimizers (ERMs) in Stochastic Convex Optimization (SCO) when the number of samples is linear in the ambient dimension. While prior work showed that ERMs can overfit in high‑dimensional regimes, the open question left by Feldman was whether a construction exists where the dimension is only Θ(m) (m = number of samples). The authors answer this affirmatively by constructing a learning instance in dimension d = 6·m with a 1‑Lipschitz, λ‑strongly convex loss (λ = Θ(m⁻³ᐟ²)). With probability at least ½ over the training sample, the empirical risk has a unique minimizer that achieves zero training error but incurs a constant excess risk on the population distribution. Consequently, the exact ERM fails to generalize despite the problem being learnable with O(1/√m) error by other algorithms.

The result is strengthened to approximate ERMs: any ε‑ERM with ε = Θ(m⁻³ᐟ²) also suffers Ω(1) excess risk. This is the first lower bound that applies to approximate ERMs with inverse‑polynomial accuracy, showing that even near‑optimal empirical solutions cannot guarantee generalization in linear dimension.

Building on the same construction, the authors analyze constrained (projected) Gradient Descent (GD). Classical analysis provides an optimization guarantee F_S(w_GD) – min F_S = Θ(η + 1/(ηT)) and a stability‑based generalization upper bound O(η√T + ηT/m). The paper proves a new lower bound on the generalization error of GD: Ω(η T / m^{1.5}). This bound holds for any choice of learning rate η and horizon T, and becomes tight when η T = Θ(m√m), a regime where GD’s training error is negligible but its population error remains large. Importantly, the bound is polynomial in the dimension (linear) rather than exponential, narrowing the gap between the best known upper bound O(η T/m) and previous lower bounds.

The authors place their contributions in context with related work: Feldman’s exponential‑dimensional ERM failure, Shalev‑Shwartz et al.’s construction of a unique failing ERM, regularized ERM results, stability‑based analyses of GD, and recent lower bounds for GD in high dimensions. Their construction shows that the success of algorithms such as SGD, regularized ERM, or stable GD does not stem from any generic property of the empirical minimizer; rather, it relies on algorithmic bias or regularization that steers the solution away from the pathological ERM.

In the discussion, the paper emphasizes two key implications. First, worst‑case ERM guarantees are insufficient for over‑parameterized convex learning; additional mechanisms (implicit bias, regularization, early stopping) are essential for generalization. Second, even a basic first‑order method like GD can overfit if hyper‑parameters are not carefully calibrated, highlighting the practical importance of controlling η and T relative to the sample size. The authors suggest future work on tightening the GD lower bound, extending the analysis to stochastic GD variants, and exploring whether similar phenomena arise for other optimization schemes.

Overall, the paper resolves Feldman’s open problem, extends failure results to approximate ERMs, and delivers a novel, dimension‑linear lower bound for GD’s generalization error, thereby deepening our theoretical understanding of over‑parameterization, empirical risk minimization, and first‑order optimization in convex settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment