Rethinking Approximate Gaussian Inference in Classification

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In classification tasks, softmax functions are ubiquitously used as output activations to produce predictive probabilities. Such outputs only capture aleatoric uncertainty. To capture epistemic uncertainty, approximate Gaussian inference methods have been proposed. We develop a common formalism to describe such methods, which we view as outputting Gaussian distributions over the logit space. Predictives are then obtained as the expectations of the Gaussian distributions pushed forward through the softmax. However, such softmax Gaussian integrals cannot be solved analytically, and Monte Carlo (MC) approximations can be costly and noisy. We propose to replace the softmax activation by element-wise normCDF or sigmoid, which allows for the accurate sampling-free approximation of predictives. This also enables the approximation of the Gaussian pushforwards by Dirichlet distributions with moment matching. This approach entirely eliminates the runtime and memory overhead associated with MC sampling. We evaluate it combined with several approximate Gaussian inference methods (Laplace, HET, SNGP) on large- and small-scale datasets (ImageNet, CIFAR-100, CIFAR-10), demonstrating improved uncertainty quantification capabilities compared to softmax MC sampling. Our code is available at https://github.com/bmucsanyi/probit.

💡 Research Summary

Classification models typically use the softmax activation to map logits to a probability simplex. When a Gaussian distribution over logits is assumed—as in many recent uncertainty‑aware methods (Laplace, heteroscedastic classifiers, SNGP)—the final predictive distribution becomes the push‑forward of a Gaussian through the softmax. This integral has no closed form, forcing practitioners to rely on Monte‑Carlo (MC) sampling. MC scales linearly with the number of classes, incurs substantial memory and runtime overhead, and introduces stochastic noise that degrades the quality of epistemic uncertainty estimates, especially for large‑scale problems such as ImageNet.

The authors propose a radical yet simple remedy: replace the softmax with an element‑wise normal CDF (Φ) or logistic sigmoid (ρ) followed by a normalisation step n(q)=q/∑q. Both Φ and ρ admit analytically tractable one‑dimensional Gaussian integrals:

∫Φ(y) N(μ,σ²) dy = Φ( μ / √(1+σ²) )
∫ρ(y) N(μ,σ²) dy ≈ ρ( μ / √(1+πσ²/8) ) (the classic probit approximation).

Consequently, for each class c we can compute a closed‑form “softened” expectation q̃_c = Φ(μ_c/√(1+σ_c²)) or ρ(μ_c/√(1+πσ_c²/8)). The final predictive probabilities are obtained by normalising these expectations: p̂_c = q̃_c / ∑_k q̃_k.

This three‑step recipe—(i) obtain Gaussian means and covariances from any approximate inference method, (ii) choose Φ or ρ as the element‑wise activation, (iii) apply the closed‑form formulas at inference—eliminates any need for sampling. The authors analyse the approximation from two angles. Empirically, on a synthetic dataset where logits are drawn uniformly, the KL divergence between the “true” predictive (approximated with 10 000 MC samples) and the proposed Φ/ρ approximations is consistently lower than that of softmax‑MC, mean‑field, and Laplace‑bridge methods. Theoretically, Theorem 3.1 shows that the KL error is bounded by a constant M(K) that depends only on a compact set K containing all (μ,σ²) pairs, plus an O(Var

Rethinking Approximate Gaussian Inference in Classification

💡 Research Summary

Comments & Academic Discussion

Leave a Comment