Amortising Inference and Meta-Learning Priors in Neural Networks

Amortising Inference and Meta-Learning Priors in Neural Networks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

One of the core facets of Bayesianism is in the updating of prior beliefs in light of new evidence$\text{ – }$so how can we maintain a Bayesian approach if we have no prior beliefs in the first place? This is one of the central challenges in the field of Bayesian deep learning, where it is not clear how to represent beliefs about a prediction task by prior distributions over model parameters. Bridging the fields of Bayesian deep learning and probabilistic meta-learning, we introduce a way to $\textit{learn}$ a weights prior from a collection of datasets by introducing a way to perform per-dataset amortised variational inference. The model we develop can be viewed as a neural process whose latent variable is the set of weights of a BNN and whose decoder is the neural network parameterised by a sample of the latent variable itself. This unique model allows us to study the behaviour of Bayesian neural networks under well-specified priors, use Bayesian neural networks as flexible generative models, and perform desirable but previously elusive feats in neural processes such as within-task minibatching or meta-learning under extreme data-starvation.


💡 Research Summary

The paper tackles a fundamental problem in Bayesian deep learning: how to obtain meaningful priors over neural network weights when no explicit prior knowledge is available. By merging Bayesian deep learning with probabilistic meta‑learning, the authors introduce the Bayesian Neural Network Process (BNNP), a novel member of the Neural Process family whose latent variable is the full set of weights of a Bayesian neural network (BNN) and whose decoder is the BNN itself.

The core technical contribution is an “amortised linear layer”. For each layer ℓ of an L‑layer MLP, a factorised Gaussian prior pψℓ(Wℓ) is assumed. The authors construct pseudo‑likelihood terms for the weights of that layer, whose parameters (pseudo‑observations and noise levels) are produced by an inference network gθℓ that processes each input‑output pair independently. Because the pseudo‑likelihood is Gaussian, the posterior over Wℓ given the pseudo‑data and the prior can be computed analytically via Bayesian linear regression. This yields a closed‑form variational posterior q(Wℓ|W1:ℓ−1,D) that depends on the posterior of the previous layers, enabling a fully Bayesian, layer‑wise amortised inference scheme. Stacking these layers gives a global variational posterior q(W|D)=∏ℓq(Wℓ|W1:ℓ−1,D).

Training is driven by a new objective called Posterior‑Predictive Amortised Variational Inference (PP‑AV I). The loss combines (i) a log posterior‑predictive term over target points (the standard Neural Process predictive loss) and (ii) the usual ELBO for Bayesian neural networks. The first term encourages accurate predictions, while the second enforces a high‑quality approximation of the true posterior. The authors prove that, in the limit of infinitely many meta‑tasks, maximizing this objective simultaneously satisfies three desiderata: accurate posterior approximation, a prior that captures the underlying data‑generating process, and strong predictive performance.

A practical challenge for Neural Processes is the memory cost of storing all context points during inference. The authors solve this by introducing within‑task minibatching: context data are split into minibatches, and each layer’s posterior is updated sequentially using Bayesian updates that discard the minibatch after incorporation. This yields the exact same posterior as full‑batch inference while dramatically reducing memory usage.

Because the prior parameters Ψ and the inference‑network parameters Θ are disentangled, the model can control the flexibility of the learned prior. By fixing the prior over a subset of layers (e.g., a zero‑mean unit‑variance Gaussian for the final layer) and learning only the remaining prior parameters, practitioners can prevent over‑fitting when only a few meta‑tasks are available.

The paper also discusses extensions. An attention‑based encoder (AttBNNP) can replace the per‑sample inference networks with transformers, allowing the model to capture interactions among context points at the cost of O(n_c²) computation and a slight deviation from exact minibatch updates. Alternatively, attention can be placed in the decoder, leading to a Bayesian Neural Attentive Machine (BNAM). However, because the decoder then makes target predictions dependent on other targets, the resulting model is no longer a consistent stochastic process.

Experiments on synthetic regression, image classification, and extreme data‑starvation meta‑learning scenarios demonstrate that BNNP outperforms GP‑based Neural Processes in both ELBO and predictive accuracy, that within‑task minibatching scales to large context sets without loss of performance, and that prior‑flexibility control mitigates over‑fitting in low‑task regimes.

In summary, the work presents a principled, scalable framework for learning and amortising Bayesian neural network priors across tasks. By providing closed‑form layer‑wise posteriors, a combined predictive‑ELBO training objective, and practical mechanisms for minibatching and prior regularisation, the BNNP advances the state of the art in Bayesian deep learning and opens avenues for richer meta‑learning applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment