Simmering: Sufficient is better than optimal for training neural networks
The broad range of neural network training techniques that invoke optimization but rely on ad hoc modification for validity suggests that optimization-based training is misguided. Shortcomings of optimization-based training are brought to particularly strong relief by the problem of overfitting, where naive optimization produces spurious outcomes. The broad success of neural networks for modelling physical processes has prompted advances that are based on inverting the direction of investigation and treating neural networks as if they were physical systems in their own right. These successes raise the question of whether broader, physical perspectives could motivate the construction of improved training algorithms. Here, we introduce simmering, a physics-based method that trains neural networks to generate weights and biases that are merely ``good enough’’, but which, paradoxically, outperforms leading optimization-based approaches. Using classification and regression examples we show that simmering corrects neural networks that are overfit by Adam, and show that simmering avoids overfitting if deployed from the outset. Our results question optimization as a paradigm for neural network training, and leverage information-geometric arguments to point to the existence of classes of sufficient training algorithms that do not take optimization as their starting point.
💡 Research Summary
The paper challenges the prevailing paradigm that neural network training should be framed as an optimization problem. While gradient‑based optimizers such as Adam are highly effective at reducing empirical loss, the authors argue that this very effectiveness is the root cause of over‑fitting, especially for over‑parameterized models that can represent many different parameter configurations with near‑identical training loss. In such settings, the optimizer typically converges to a single minimum that captures idiosyncrasies of the training data, leading to poor generalization on unseen test data.
To address this, the authors introduce “simmering,” a physics‑inspired training method that treats network weights and biases as particles in a thermodynamic system. They define a partition function Z(β, D)=∑ₓ exp(−β L(x,D)), where L is the loss on dataset D, β is the Laplace‑transform variable, and T = 1/β plays the role of temperature. In the limit β→∞ (T→0) the method reduces to conventional loss minimization; for finite β the system samples from a Boltzmann‑like distribution over parameter space, thereby exploring a cloud of near‑optimal configurations rather than a single point.
Implementation relies on Nosé‑Hoover chain thermostats from molecular dynamics. The weights and biases are augmented with auxiliary momenta, yielding a Hamiltonian system whose equations of motion are integrated with a symplectic scheme. By gradually raising the temperature (e.g., from T = 0 to T = 0.05 in loss units), the algorithm “simmers” the network, allowing it to wander among sloppy modes—directions in parameter space identified by small eigenvalues of the Fisher information matrix. These sloppy modes correspond to families of parameter sets that fit the noisy training data equally well but differ in how they capture underlying signal versus noise.
The authors evaluate simmering in two ways. First, they use it as a retro‑fitting step: a network trained to over‑fit with Adam serves as the initial condition, after which simmering is applied. On a synthetic sinusoidal regression task, the Adam‑trained model exhibits a clear divergence between training and test loss and a visibly distorted fit. Simmering at modest temperature produces an ensemble of models whose averaged prediction closely matches the true sinusoid, eliminating the over‑fit distortion. Similar retro‑fitting experiments on MNIST image classification, HIGGS event classification, IRIS species classification, and automotive‑MPG regression all show improved test accuracy or R² after simmering.
Second, they train networks “ab initio” with simmering, i.e., without any prior optimization. In these experiments the algorithm directly samples an ensemble of models at finite temperature, yielding smooth decision boundaries for classification and well‑behaved prediction intervals for regression. Because the ensemble reflects the maximum‑entropy distribution consistent with the loss, it naturally provides calibrated uncertainty estimates, a feature absent from single‑point optimizers.
From an information‑geometric perspective, the paper explains that over‑parameterization creates manifolds of near‑equivalent minima connected by sloppy directions. Traditional optimizers lock onto one point on this manifold, whereas simmering, by virtue of its temperature parameter, reduces the effective distance between optimal and near‑optimal points, encouraging exploration of the manifold. This exploration leads to parameter configurations that are less entangled with training‑set noise, thereby improving generalization. The authors also note that other thermostats (Langevin, Andersen) could replace Nosé‑Hoover, suggesting a broader class of “sufficient‑training” algorithms that prioritize adequate performance over strict optimality.
Overall, the paper makes three key contributions: (1) a rigorous critique of optimization‑centric training and its link to over‑fitting; (2) the formulation of a thermodynamic, temperature‑controlled training algorithm that leverages molecular‑dynamics tools to sample ensembles of sufficiently good models; and (3) empirical evidence across multiple domains that both retro‑fitting and pure simmering outperform Adam in terms of test accuracy and provide meaningful uncertainty quantification. By reframing neural network training as a statistical‑physics sampling problem rather than a deterministic minimization task, the work opens a new avenue for developing robust, generalizable deep learning systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment