Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

For four decades statistical physics has been providing a framework to analyse neural networks. A long-standing question remained on its capacity to tackle deep learning models capturing rich feature learning effects, thus going beyond the narrow networks or kernel methods analysed until now. We positively answer through the study of the supervised learning of a multi-layer perceptron. Importantly, (i) its width scales as the input dimension, making it more prone to feature learning than ultra wide networks, and more expressive than narrow ones or ones with fixed embedding layers; and (ii) we focus on the challenging interpolation regime where the number of trainable parameters and data are comparable, which forces the model to adapt to the task. We consider the matched teacher-student setting. Therefore, we provide the fundamental limits of learning random deep neural network targets and identify the sufficient statistics describing what is learnt by an optimally trained network as the data budget increases. A rich phenomenology emerges with various learning transitions. With enough data, optimal performance is attained through the model’s “specialisation” towards the target, but it can be hard to reach for training algorithms which get attracted by sub-optimal solutions predicted by the theory. Specialisation occurs inhomogeneously across layers, propagating from shallow towards deep ones, but also across neurons in each layer. Furthermore, deeper targets are harder to learn. Despite its simplicity, the Bayes-optimal setting provides insights on how the depth, non-linearity and finite (proportional) width influence neural networks in the feature learning regime that are potentially relevant in much more general settings.

💡 Research Summary

The paper revisits the long‑standing question of whether statistical‑physics methods can be extended to deep learning models that exhibit genuine feature‑learning, rather than merely reproducing kernel‑like behavior. To this end the authors study a fully‑connected multi‑layer perceptron (MLP) whose hidden‑layer widths scale linearly with the input dimension d (kₗ = Θ(d)). This “proportional‑width” regime sits between the ultra‑wide limit (where the network linearizes and is described by the Neural‑Tangent‑Kernel) and the classic narrow‑network limit (where expressive power is severely limited).

The learning scenario is the matched teacher‑student setting. A teacher network of the same architecture and random Gaussian weights generates labels y = Fₜₑₐ₍ₓ₎ for inputs x drawn from a standard normal distribution. The student network is trained on n examples, with n and the total number of trainable parameters P both of order Θ(d). This “interpolation” regime forces the model to adapt to the task rather than simply over‑parameterize it.

Using replica theory and the computation of the thermodynamic free energy, the authors derive the Bayes‑optimal (or “teacher‑matching”) posterior and identify the sufficient statistics that fully describe the learned representation as the data budget grows. The analysis reveals a rich phenomenology of learning transitions:

Ignorant phase (few data). When n ≪ d, the free‑energy landscape has a single global minimum corresponding to a random, uninformative configuration. The mean‑squared error (MSE) remains high and the network does not extract any structure from the inputs.
Partial specialization. Once the data exceed a first critical ratio α₁ = n/d, the shallow layers begin to align with the teacher’s internal representations. This creates new local minima in the free‑energy landscape, and the MSE drops sharply. However, deeper layers remain largely random, leading to an inhomogeneous learning pattern across depth.
Full specialization. At a second, larger critical ratio α₂ > α₁, the specialization propagates through the network. All layers and most neurons become correlated with the teacher’s weights, and the MSE reaches the Bayes‑optimal value. The transition is gradual and proceeds neuron‑by‑neuron: within a given layer, a subset of “core” neurons aligns early, while “auxiliary” neurons lag behind.

The specialization process is thus layer‑wise and neuron‑wise heterogeneous. The authors term this “propagated feature learning.” Moreover, the depth of the teacher network dramatically affects the required data: deeper teachers shift both α₁ and α₂ upward, confirming the intuitive notion that “deeper targets are harder to learn.”

From an algorithmic standpoint, the paper shows that standard stochastic gradient descent (SGD) or adaptive methods such as Adam typically get trapped in the intermediate local minima rather than reaching the Bayes‑optimal state. This “algorithmic trap” is especially pronounced in the interpolation regime where the landscape is rugged. The authors suggest practical mitigations: (i) over‑training the shallow layers early on, (ii) layer‑wise regularization or initialization that mimics the teacher’s statistics, and (iii) data augmentation to effectively increase the α ratios.

In summary, the work demonstrates that an MLP with widths proportional to the input dimension, trained near the interpolation threshold, exhibits genuine feature learning that cannot be captured by kernel theories. Depth, non‑linearity, and finite proportional width jointly shape a sequence of learning phase transitions, culminating in a specialization process that mirrors the hierarchical feature extraction observed in modern deep networks. By providing exact Bayes‑optimal limits and a detailed phase diagram, the paper bridges the gap between earlier narrow‑network spin‑glass analyses and the ultra‑wide NTK regime, offering a solid theoretical foundation for understanding feature learning in realistic deep learning models.

Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment