Practical Deep Heteroskedastic Regression

Practical Deep Heteroskedastic Regression
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Uncertainty quantification (UQ) in deep learning regression is of wide interest, as it supports critical applications including sequential decision making and risk-sensitive tasks. In heteroskedastic regression, where the uncertainty of the target depends on the input, a common approach is to train a neural network that parameterizes the mean and the variance of the predictive distribution. Still, training deep heteroskedastic regression models poses practical challenges in the trade-off between uncertainty quantification and mean prediction, such as optimization difficulties, representation collapse, and variance overfitting. In this work we identify previously undiscussed fallacies and propose a simple and efficient procedure that addresses these challenges jointly by post-hoc fitting a variance model across the intermediate layers of a pretrained network on a hold-out dataset. We demonstrate that our method achieves on-par or state-of-the-art uncertainty quantification on several molecular graph datasets, without compromising mean prediction accuracy and remaining cheap to use at prediction time.


💡 Research Summary

The paper tackles the practical challenges of deep heteroskedastic regression, where a neural network must predict both a mean and an input‑dependent variance for regression tasks. While this formulation is attractive for providing total predictive uncertainty, training such mean‑variance networks end‑to‑end with a Gaussian negative log‑likelihood (NLL) loss suffers from four well‑documented problems: (1) optimization difficulties caused by the NLL gradient structure, which can stall learning when the variance grows large; (2) last‑layer representation collapse, where the backbone learns features only useful for the mean, discarding directions needed for variance estimation; (3) residual variance over‑fitting, a manifestation of double‑descent where residuals computed on the training set do not generalize; and (4) practicality concerns for large‑scale models, including the need to preserve mean accuracy, avoid extra hyper‑parameter tuning, keep inference cheap, and maintain a low implementation barrier.

The authors first provide a thorough analysis of each problem, reviewing prior attempts such as β‑NLL re‑weighting, staged training, regularization tricks, and Bayesian priors. Most of these either introduce additional hyper‑parameters, degrade mean performance, or require changes to the training pipeline.

To address all four issues simultaneously, the paper proposes a post‑hoc variance learning scheme. A deep model is first trained in the usual way to predict only the mean (the “backbone”). Then, using a held‑out calibration set, a set of linear variance heads is fitted on one or more intermediate latent representations (z_l) extracted from the frozen backbone. The variance predictor takes the form

\


Comments & Academic Discussion

Loading comments...

Leave a Comment