Lecture notes: From Gaussian processes to feature learning
These lecture notes develop the theory of learning in deep and recurrent neuronal networks from the point of view of Bayesian inference. The aim is to enable the reader to understand typical computations found in the literature in this field. Initial chapters develop the theoretical tools, such as probabilities, moment and cumulant-generating functions, and some notions of large deviation theory, as far as they are needed to understand collective network behavior with large numbers of parameters. The main part of the notes derives the theory of Bayesian inference for deep and recurrent networks, starting with the neural network Gaussian process (lazy-learning) limit, which is subsequently extended to study feature learning from the point of view of adaptive kernels. The notes also expose the link between the adaptive kernel approach and approaches of kernel rescaling.
💡 Research Summary
These lecture notes present a comprehensive, physics‑inspired framework for understanding learning in deep feed‑forward and recurrent neural networks from a Bayesian perspective. The authors begin by motivating the need for a principled theory of artificial neural networks (ANNs), noting that current design practices rely heavily on intuition and trial‑and‑error, which is costly in terms of computation and energy. They then outline how statistical physics—particularly concepts from disordered systems and large‑deviation theory—can be transplanted to the study of ANNs.
The first technical chapters (2–3) introduce the mathematical toolbox required for the rest of the manuscript. Chapter 2 reviews probability theory, observables, moments, cumulants, transformations of random variables, joint and conditional distributions, and the relationship between moments and cumulants. Chapter 3 focuses on the Gaussian distribution, deriving its moment‑generating and cumulant‑generating functions and proving Wick’s theorem, which will later allow the authors to handle high‑order correlations in Gaussian fields efficiently.
Chapter 4 recasts linear regression in a Bayesian setting. With a Gaussian prior on the weights, the posterior remains Gaussian, and the authors derive the familiar bias‑variance decomposition and show how the Bayesian predictor coincides with the ridge‑regression solution. This simple example serves as a pedagogical bridge to more complex, non‑linear models.
Chapter 5 introduces the law of large numbers in its stronger form—the large‑deviation principle—via the Gärtner‑Ellis theorem. By relating the cumulant‑generating function to a Legendre transform, the authors obtain a “free‑energy” functional that governs the macroscopic behavior of a system with many degrees of freedom. This machinery is later used to approximate the distribution of network parameters when the width tends to infinity.
The core of the notes begins with Chapter 6, where the Neural Network Gaussian Process (NNGP) is derived. Starting from a single‑hidden‑layer perceptron, the authors employ a field‑theoretic approach to show that, in the infinite‑width limit, the pre‑activations become independent Gaussian fields. The mean and covariance of the network output are expressed through a kernel that recursively composes across layers, yielding the well‑known “lazy learning” regime: the weight distribution after training remains essentially the same as at initialization, and learning is captured entirely by the evolution of the kernel.
Chapter 7 extends the same analysis to recurrent neural networks (RNNs). By treating time as an additional dimension in the field theory, the authors demonstrate that wide RNNs also converge to a Gaussian process. They discuss the emergence of a chaos transition and depth‑scale phenomena, and they use large‑deviation arguments to delineate stable fixed points from chaotic regimes.
Chapter 8 derives the Fokker‑Planck equation for the probability density of network parameters under stochastic gradient descent (SGD) interpreted as Langevin dynamics. The stationary solution of this equation coincides with the Bayesian posterior obtained from the prior and the likelihood, establishing a formal link between training dynamics and Bayesian inference. The authors illustrate the theory with the Ornstein‑Uhlenbeck process and show how gradient descent on linear regression can be understood as a Bayesian update.
Chapter 9 tackles the central limitation of the NNGP: it is label‑agnostic and therefore cannot capture feature learning. Two complementary approaches are presented. The first is a kernel‑scaling perspective, where a scalar “scale” parameter evolves with the amount of data, effectively widening the kernel as more samples are observed. The second is an adaptive‑kernel framework in which the kernel itself is treated as a dynamical variable that updates together with the weights, allowing the network to develop data‑dependent representations even in the infinite‑width limit. The authors connect these viewpoints, discuss their relationship to the Neural Tangent Kernel (NTK), and provide numerical experiments that validate the theoretical predictions.
The notes conclude with a nomenclature appendix and several exercises that reinforce the material. Throughout, the authors emphasize that the presented methods yield average‑case, rather than worst‑case, results, aligning with the statistical‑physics tradition of focusing on typical behavior in high‑dimensional systems. By integrating Bayesian inference, large‑deviation theory, field theory, and stochastic dynamics, the manuscript offers a unified, mathematically rigorous picture of how deep and recurrent networks transition from lazy learning to genuine feature learning, thereby furnishing researchers and engineers with tools to predict generalization performance, sample complexity, and the impact of architectural choices before costly training is performed.
Comments & Academic Discussion
Loading comments...
Leave a Comment