Multivariate Bayesian Last Layer for Regression with Uncertainty Quantification and Decomposition
We present new Bayesian Last Layer neural network models in the setting of multivariate regression under heteroscedastic noise, and propose EM algorithms for parameter learning. Bayesian modeling of a neural network’s final layer has the attractive property of uncertainty quantification with a single forward pass. The proposed framework is capable of disentangling the aleatoric and epistemic uncertainty, and can be used to enhance a canonically trained deep neural network with uncertainty-aware capabilities.
💡 Research Summary
The paper introduces a novel Bayesian Last Layer (BLL) framework for multivariate regression under heteroscedastic noise, extending the classical Bayesian linear regression (BLR) to modern deep learning settings. The authors model the output as y = A ϕ(x) + σ(x) ε, where ϕ(x) is a deep neural network feature extractor, σ(x) is an input‑dependent noise scaling function, and ε∼𝒩(0,V) is a Gaussian white noise vector with covariance V. By placing a matrix‑Normal prior A ∼ 𝒩(M, V, K) on the final linear weights, they retain analytical tractability: the joint, marginal (evidence), posterior, and predictive distributions all have closed‑form expressions. This enables single‑pass predictions that simultaneously provide a mean estimate and a full covariance matrix, allowing a clean separation of aleatoric uncertainty (captured by σ²(x) V) and epistemic uncertainty (captured by the posterior variance of A).
A key theoretical contribution is the analysis of evidence maximization. The authors prove that naïvely maximizing the marginal likelihood with respect to all BLL parameters leads to a degenerate maximum‑likelihood solution, which can cause over‑fitting. To avoid this, they propose two stabilization strategies: fixing the prior mean M (or equivalently the posterior mean) and introducing a hyper‑prior on the column‑covariance K (e.g., an inverse‑Wishart). They also discuss an evidential learning framework that treats hyper‑parameters as type‑II maximum likelihood estimators, and they show how this connects to variational Bayesian last layer (VBLL) methods.
Training is performed with an Expectation‑Maximization (EM) algorithm that is compatible with mini‑batch stochastic gradient descent. In the E‑step, given current network parameters and hyper‑parameters, the posterior mean S_yx S_xx⁻¹ and covariance S_xx⁻¹ are computed analytically. In the M‑step, the deep feature extractor ϕ and the noise scaling σ are updated using standard back‑propagation on a loss derived from the expected complete‑data log‑likelihood, while the hyper‑parameters (M, K, V) are updated via closed‑form formulas. This decouples deterministic network training from Bayesian hyper‑parameter inference, making the approach suitable for transfer learning: a pre‑trained backbone can be frozen, and only the last layer and its Bayesian parameters need to be re‑estimated for a new task.
The authors extend the framework to the case where the noise covariance V is unknown by employing a matrix‑T distribution, which treats V as a random variable and yields heavier‑tailed predictive distributions. This generalization provides more conservative uncertainty estimates when data are scarce.
Empirical evaluation includes three parts:
- Synthetic heteroscedastic regression demonstrates that the model accurately recovers the input‑dependent noise scale and provides well‑calibrated predictive intervals.
- Transfer learning experiments where a ResNet‑50 pretrained on ImageNet is equipped with a BLL head for downstream regression tasks (e.g., age estimation). The BLL‑enhanced model matches the baseline in RMSE while achieving substantially lower negative log‑likelihood (NLL) and better calibration.
- Benchmarks on several multivariate UCI regression datasets (Energy, Kin8nm, Naval, etc.) show that BLL outperforms Deep Evidential Regression, Monte‑Carlo Dropout, and variational Bayesian neural networks in both predictive accuracy and uncertainty quantification.
Overall, the paper makes several significant contributions: (i) a mathematically rigorous extension of Bayesian linear regression to deep, multivariate, heteroscedastic settings; (ii) a stable EM‑based learning algorithm that integrates seamlessly with modern deep learning pipelines; (iii) a clear decomposition of aleatoric and epistemic uncertainties with closed‑form predictive distributions; and (iv) practical demonstrations of the method’s utility in transfer learning and real‑world regression benchmarks. Future directions suggested include Bayesian treatment of deeper layers, sparsity‑inducing priors, and scaling to multimodal large‑scale data.
Comments & Academic Discussion
Loading comments...
Leave a Comment