Preventing Model Collapse Under Overparametrization: Optimal Mixing Ratios for Interpolation Learning and Ridge Regression
Model collapse occurs when generative models degrade after repeatedly training on their own synthetic outputs. We study this effect in overparameterized linear regression in a setting where each iteration mixes fresh real labels with synthetic labels drawn from the model fitted in the previous iteration. We derive precise generalization error formulae for minimum-$\ell_2$-norm interpolation and ridge regression under this iterative scheme. Our analysis reveals intriguing properties of the optimal mixing weight that minimizes long-term prediction error and provably prevents model collapse. For instance, in the case of min-$\ell_2$-norm interpolation, we establish that the optimal real-data proportion converges to the reciprocal of the golden ratio for fairly general classes of covariate distributions. Previously, this property was known only for ordinary least squares, and additionally in low dimensions. For ridge regression, we further analyze two popular model classes – the random-effects model and the spiked covariance model – demonstrating how spectral geometry governs optimal weighting. In both cases, as well as for isotropic features, we uncover that the optimal mixing ratio should be at least one-half, reflecting the necessity of favoring real-data over synthetic. We study three additional settings: (i) where real data is fixed and fresh labels are not obtained at each iteration, (ii) where covariates vary across iterations but fresh real labels are available each time, and (iii) where covariates vary with time but only a fraction of them receive fresh real labels at each iteration. Across these diverse settings, we characterize when model collapse is inevitable and when synthetic data improves learning. We validate our theoretical results with extensive simulations.
💡 Research Summary
This paper investigates the phenomenon of model collapse—where generative models degrade after repeatedly training on their own synthetic outputs—in the context of over‑parameterized linear regression. The authors consider an iterative learning scheme in which, at each iteration, fresh real responses are mixed with synthetic responses generated by the estimator from the previous iteration. A mixing weight (w\in(0,1)) controls the proportion of real data. Two families of estimators are studied: the minimum‑(\ell_2)-norm interpolator (the limit of ridge regression as the regularization parameter (\lambda\to0)) and ridge regression with a fixed (\lambda>0).
The setting assumes a high‑dimensional regime where the number of features (p) exceeds the number of samples (n) and both grow proportionally ((p/n\to\gamma>1)). The design matrix is modeled as (X=Z\Sigma^{1/2}) with i.i.d. entries in (Z) having zero mean, unit variance, and bounded higher moments; (\Sigma) is a general covariance matrix with bounded eigenvalues. Noise is i.i.d. Gaussian with variance (\sigma^2), and the signal strength (|\beta|_2^2) converges to a finite constant (b^\star).
Main theoretical contributions
- Risk of the interpolator (Theorem 3.1).
As the number of iterations (t) and the dimension (n) tend to infinity, the out‑of‑sample risk of the interpolator converges almost surely to
\
Comments & Academic Discussion
Loading comments...
Leave a Comment