Activation-Space Uncertainty Quantification for Pretrained Networks

Activ ation-Space Uncertainty Quantiﬁcation f or Pr etrained Networks Richard Bergna 1 Stefan Depeweg 2 Sergio Calvo-Ordoñez 3 Jonathan Plenk 3 Alvar o Cartea 3 Jose Miguel Her nández-Lobato 1 Abstract Reliable uncertainty estimates are crucial for de- ploying pretrained models; yet, many strong meth- ods for quantifying uncertainty require retrain- ing, Monte Carlo sampling, or expensi ve second- order computations and may alter a frozen back- bone’ s predictions. T o address this, we introduce Gaussian Process Activations (GAP A) , a post- hoc method that shifts Bayesian modeling from weights to acti vations. GAP A replaces standard nonlinearities with Gaussian-process acti vations whose posterior mean exactly matches the original activ ation, preserving the backbone’ s point pre- dictions by construction while pro viding closed- form epistemic v ariances in activ ation space. T o scale to modern architectures, we use a sparse variational inducing-point approximation ov er cached training acti vations, combined with local k -nearest-neighbor subset conditioning, enabling deterministic single-pass uncertainty propagation without sampling, backpropagation, or second- order information. Across regression, classiﬁca- tion, image segmentation, and language modeling, GAP A matches or outperforms strong post-hoc baselines in calibration and out-of-distribution de- tection while remaining efﬁcient at test time. 1. Introduction Reliable uncertainty quantiﬁcation (UQ) is crucial in risk- sensitiv e deployments, yet many effecti ve research methods remain impractical in modern settings (Abdar et al., 2021). W eight-space Bayesian approaches (e.g., v ariational BNNs) often require retraining, labeled data or multi-sample e valu- ation, ensembles multiply compute, and Laplace-style meth- ods rely on curvature estimates that scale poorly as models and output spaces grow (Blundell et al., 2015; MacKay, 1 Department of Engineering, Uni versity of Cambridge, Cam- bridge, UK 2 Siemens A G, Munich, Germany 3 Mathematical Insti- tute and Oxford-Man Institute, University of Oxford, Oxford, UK. Correspondence to: Richard Bergna . Pr oceedings of the 43 rd International Conference on Machine Learning , Seoul, South K orea. PMLR 306, 2026. Copyright 2026 by the author(s). 1992; Ber gna et al., 2024; Gal and Ghahramani, 2016; Lakshminarayanan et al., 2017; Ritter et al., 2018; Ortega et al., 2023). The gap is most pronounced for pretrained backbones, where weights are not expected to be modiﬁed and test-time budgets fa vor single-pass inference. In this regime, a practical post-hoc method should be single-pass, prediction-preserving, epistemic, and scalable to foundation models (T able 1). T able 1. The gap in uncertainty quantiﬁcation methods. Method Post-hoc Single Pass Preserv es Mean Epistemic UQ Foundation Ready BNNs ✗ ✗ ✗ ✓ ✗ Ensembles ✗ ✗ — ✓ ✗ MC Dropout ✗ ✗ ✗ ✓ ✗ Laplace ✓ ✗ ✗ ✓ ✗ LL-Laplace ✓ ✓ ✗ ✓ ✗ T emp. Scaling ✓ ✓ ✗ ✗ ✓ GAP A (Ours) ✓ ✓ ✓ ✓ ✓ W e address this by shifting uncertainty modeling from weights to activations . W e introduce Gaussian Pr ocess Activations (GAP A) , a drop-in module that replaces de- terministic acti vations with Gaussian-process acti vations whose posterior mean matc hes the original nonlinearity , thereby preserving the frozen backbone’ s point predictions by construction while producing activ ation-space epistemic variances (Figure 1). For scalability , GAP A conditions on cached training activ ations using a sparse approximation (compression + local k NN conditioning), and propagates the resulting uncertainty through the network via deterministic variance-propag ation rules, enabling single-pass predictive uncertainty . Our contributions are 1. Mean-preser ving post-hoc UQ: GAP A provides epis- temic uncertainty for pretrained networks while preserv- ing point predictions. 2. Scalable conditioning: we combine induction points and local k -NN conditioning for practical inference at modern scales. 3. Deterministic propagation: we deriv e single-pass v ari- ance propagation from acti vation space to output space. 4. Empirical validation: across regression, classiﬁcation, segmentation, and language modelling, GAP A improv es calibration and OOD detection with fast inference time. 1 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks MAP (backbone) MC Dropout LL-Laplace GAP A (Ours) 0.5 0.6 0.7 0.8 0.9 1.0 Figure 1. Comparison of uncertainty quantiﬁcation methods on a toy binary classiﬁcation task. Left to right : MAP (deterministic backbone), MC Dropout, Last-Layer Laplace, and GAP A (ours). Background shading indicates predictiv e conﬁdence (darker = more conﬁdent); orange/yellow points sho w the two classes. Key observation : GAP A preserves the backbone’ s decision boundary (black line) exactly while adding epistemic uncertainty that gro ws smoothly away from training data. Figure 2. GAP A overvie w . T op: GAP A leaves the netw ork’ s point predictions unchanged (mean-preserving activ ations) while propagating an additional epistemic v ariance signal to the output. Bottom left: deterministic tanh activation; orange points denote cached training activ ations. Bottom right: GAP A- tanh , whose posterior mean matches tanh exactly; the shaded re gion shows ± 2 standard deviations. 2. Model Proposition At a high lev el, GAP A augments a frozen neural network with acti vation-space uncertainty while strictly preserving its original deterministic predictions. Figure 2 provides a structural overvie w . In the follo wing sections, we formal- ize our uncertainty perspectiv e (Sec. 2.1), deﬁne the GP activ ation layer (Sec. 2.2), introduce a scalable inference mechanism (Sec. 2.3), and deriv e rules for single-pass v ari- ance propagation through deep architectures (Sec. 2.4). Method pipeline. GAP A operates in two phases. (i) Of- ﬂine collection: we run a single forward pass of a refer- ence input training set through the pre-trained backbone and cachepre-acti vations at selected layers. Optionally , we compress the cache into a smaller inducing set via k -means, yielding inducing inputs that admit a variational inducing- point interpretation in the sense of T itsias (2009). (ii) T est- time inference: we replace deterministic acti vations with Gaussian-process (GP) activ ations that return activation- space epistemic variances. These variances are then propa- gated forward through the remaining frozen netw ork using closed-form variance propag ation rules, enabling determin- istic single-pass predictiv e uncertainty without sampling, backpropagation, or retraining, while preserving the back- bone’ s point predictions. 2.1. Uncertainty Modeling Perspecti ve W e position GAP A by stating what is random in the pre- dictiv e distribution. Let D = { ( x n , y n ) } N n =1 denote the training data, and let x be a test input with corresponding output y . W e denote by f ( x ) the latent predictor (e.g., a neural netw ork or GP) ev aluated at x . Predictive uncertainty is obtained by marginalizing the latent predictor: p ( y | x, D ) = Z p  y | f ( x )  p  f ( x ) | x, D  d f ( x ) . (1) W eight-space uncertainty . W eight-space methods (BNNs, Laplace) parameterize f ( x ) = f ( x ; w ) and infer a posterior w ∼ p ( w | D ) , inducing epistemic uncertainty via variability across plausible weights. Activation-space uncertainty (GAP A). GAP A keeps the frozen weights deterministic and instead places epistemic uncertainty on the hidden-layer activation (e.g., ReLU out- puts). For a chosen layer ℓ , write f ( x ) = h ℓ  ϕ ( z ℓ ( x ))  , where z ℓ ( x ) are layer- ℓ pre-activ ations, ϕ is an element- wise nonlinearity , and h ℓ ( · ) denotes the remaining frozen network. W e replace ϕ with a GP activ ation g ℓ such that its posterior mean matches the original acti vation, µ ℓ ( z ) = ϕ ( z ) (Sec. 2.2), thereby preserving the backbone point prediction exactly . Let a ℓ := g ℓ ( z ℓ ( x )) be the result- ing random activ ation. Then p ( y | x, D ) = Z p  y | h ℓ ( a ℓ )  p ( a ℓ | z ℓ ( x ) , D ) d a ℓ . (2) Uncertainty gro ws as test-time pre-activ ations move away from regions supported by the training data. 2 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks 2.2. Gaussian Process Acti vation Function W e deﬁne the core G A PA module: a drop-in replace- ment for a deterministic activ ation that (i) preserv es the frozen backbone’ s point predictions exactly and (ii) returns a distance-aware epistemic v ariance in activ ation space. Setup. Consider a frozen network and layer ℓ of width d ℓ . Let z ℓ = W ℓ h ℓ − 1 + b ℓ ∈ R d ℓ , h ℓ = ϕ ( z ℓ ) , where ϕ ( z ) = ( ϕ ( z 1 ) , . . . , ϕ ( z d ℓ )) ⊤ is an element-wise nonlinearity (e.g. ReLU). GP activation. GAP A replaces the deterministic activ a- tion ϕ ( · ) with a vector -valued Gaussian process (GP) g ℓ ( · ) : R d ℓ → R d ℓ , g ℓ ( · ) ∼ G P  m ℓ ( · ) , K ℓ ( · , · )  . For an y input z , this GP induces a Gaussian mar ginal dis- tribution o ver the activ ation vector a ℓ := g ℓ ( z ) , with posterior mean µ ℓ ( z ) and cov ariance K ℓ ( z , z ) . For scalability , we use a diagonal output kernel, K ℓ ( z , z ′ ) = diag  k ℓ, 1 ( z , z ′ ) , . . . , k ℓ,d ℓ ( z , z ′ )  , which is equi valent to modeling each acti vation dimension as an independent scalar GP . Importantly , GAP A nev er sam- ples from these distrib utions: all uncertainty is propagated analytically via closed-form moment propagation. Data collation f or the GP . W e run a single forward pass of the backbone’ s training data and cache the resulting pre- activ ations ˜ Z ℓ = { ˜ z ( m ) ℓ } M m =1 , ˜ z ( m ) ℓ ∈ R d ℓ . At each cached input we form noiseless pseudo- observations by ev aluating the original activ ation, ˜ y ( m ) ℓ = ϕ  ˜ z ( m ) ℓ  ∈ R d ℓ , m = 1 , . . . , M , and collect them as ˜ Y ℓ ∈ R M × d ℓ . W e condition the GP on the dataset D ℓ =  ( ˜ z ( m ) ℓ , ˜ y ( m ) ℓ )  M m =1 , using a small jitter/noise term σ 2 n for numerical stability . Mean preser vation. W e choose the GP prior mean to match the original activ ation, m ℓ ( z ) = ϕ ( z ) . Under standard GP regression, the posterior mean at z ∗ ℓ can be written as µ ℓ ( z ∗ ℓ ) = m ℓ ( z ∗ ℓ ) + A ℓ ( z ∗ ℓ )  ˜ Y ℓ − m ℓ ( ˜ Z ℓ )  | {z } = 0 , where A ℓ ( z ∗ ℓ ) = K ℓ ( z ∗ ℓ , ˜ Z ℓ )  K ℓ ( ˜ Z ℓ , ˜ Z ℓ ) + σ 2 n I  − 1 . Since ˜ Y ℓ = ϕ ( ˜ Z ℓ ) = m ℓ ( ˜ Z ℓ ) by construction, the residual term is identically zero and hence µ ℓ ( z ∗ ℓ ) = m ℓ ( z ∗ ℓ ) = ϕ ( z ∗ ℓ ) for all z ∗ ℓ . Therefore, the GAP A activation has posterior mean equal to the original activ ation function. Substituting h ℓ = g ℓ ( z ℓ ) into a frozen network preserves the backbone’ s point pre- dictions exactly . The remaining posterior covariance of the GP acti vation, which quantiﬁes epistemic uncertainty in activ ation space, is speciﬁed next. Posterior covariance. While the posterior mean is un- changed, the posterior cov ariance is non-zero and diagonal. For each neuron i , [ Σ ℓ ( z ∗ ℓ )] ii = k ℓ,i ( z ∗ ℓ , z ∗ ℓ ) −  k ℓ,i ( z ∗ ℓ ) ⊤  K ℓ,i + σ 2 n I  − 1 k ℓ,i ( z ∗ ℓ )  , where [ K ℓ,i ] mn = k ℓ,i ( ˜ z ( m ) ℓ , ˜ z ( n ) ℓ ) and [ k ℓ,i ( z ∗ ℓ )] m = k ℓ,i ( z ∗ ℓ , ˜ z ( m ) ℓ ) . Collecting all neuron-wise variances yields the layer cov ariance Σ ℓ ( z ∗ ℓ ) = diag  [ Σ ℓ ( z ∗ ℓ )] 11 , . . . , [ Σ ℓ ( z ∗ ℓ )] d ℓ d ℓ  . (3) The resulting layer cov ariance satisﬁes Σ ℓ ( z ∗ ℓ ) ∈ R d ℓ × d ℓ and captures neuron-wise uncertainty that increases as z ∗ ℓ departs from the cached pre-activ ations. Why diagonal co variance? W e model neurons as condi- tionally independent (i.e., a diagonal output cov ariance) for tractability . A full multi-output cov ariance would require storing and propagating dense d ℓ × d ℓ matrices, which is prohibitiv e in memory and compute for modern wide net- works. The diagonal approximation is standard in scalable Bayesian models and is suf ﬁcient in our setting to capture activ ation-lev el epistemic uncertainty , as evidenced by our strong empirical results (Sec. 3). 2.3. Local Inducing-Point A pproximation At layer ℓ , GAP A conditions d ℓ independent scalar GPs on a cache of N ℓ pre-activ ations ˜ Z ℓ ∈ R N ℓ × d ℓ obtained from a single ofﬂine forward pass of the backbone’ s training data (or a subset thereof) through the frozen network. Even with 3 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks a diagonal output k ernel, e xact GP conditioning in Eq. (3) re- quires solving an N ℓ × N ℓ linear system per neuron, yielding O ( d ℓ N 3 ℓ ) time and O ( d ℓ N 2 ℓ ) memory , which is prohibitiv e for modern networks. W e therefore use a tw o-stage approxi- mation: (i) an ofﬂine global inducing set that compresses the cache, and (ii) test-time local conditioning on the K nearest inducing points to each query pre-activ ation. Stage 1: Inducing-point construction (ofﬂine). W e con- struct inducing inputs Z ℓ ∈ R M ℓ × d ℓ with M ℓ ≪ N ℓ by running k -means on the cached pre-acti vations ˜ Z ℓ and tak- ing the M ℓ centroids. These inducing points provide a compressed representation of the training-data activ ation cache used for scalable GP inference. A formal connec- tion between this procedure and variational inducing-point GPs is provided in Appendix B, and Appendix M.2 studies sensitivity to the number of inducing points. Stage 2: Local K -nearest-neighbour conditioning (test time). At test time, using all M ℓ inducing points would require an M ℓ × M ℓ solve per query , which is still expensi ve. W e therefore adopt a local GP approximation (Gramacy and Apley, 2015): for each query pre-acti vation z ∗ ℓ ∈ R d ℓ , we form a small local subset of inducing inputs by K -nearest neighbours in activ ation space. Concretely , let N K ( z ∗ ℓ ) ⊆ { 1 , . . . , M ℓ } denote the indices of the K nearest inducing points to z ∗ ℓ under Euclidean distance, and deﬁne Z ℓ,K ( z ∗ ℓ ) = Z N K ( z ∗ ℓ ) ℓ ∈ R K × d ℓ , where Z N K ( · ) ℓ denotes the submatrix formed by selecting the corresponding ro ws of Z ℓ . Neighbour retriev al is performed using F AISS (Douze et al., 2025) with approximate nearest- neighbour search, yielding sublinear query time in practice. Appendix M.5 further studies sensitivity to K . Appr oximate posterior covariance. Gi ven the local in- ducing inputs Z ℓ,K ( z ∗ ℓ ) ∈ R K × d ℓ , we approximate the posterior variance of neuron i ∈ { 1 , . . . , d ℓ } by applying the standard GP conditional-v ariance formula restricted to this local subset: σ 2 ℓ,i ( z ∗ ℓ ) ≈ k ℓ,i ( z ∗ ℓ , z ∗ ℓ ) − k ℓ,i ( z ∗ ℓ ) ⊤  K ℓ,i + σ 2 n I  − 1 k ℓ,i ( z ∗ ℓ ) , (4) where σ 2 n is a small jitter term for numerical stability . Let { z ( m ) ℓ,K } K m =1 denote the rows of Z ℓ,K ( z ∗ ℓ ) . Then the K × K kernel matrix and K -vector are deﬁned as [ K ℓ,i ] mn = k ℓ,i ( z ( m ) ℓ,K , z ( n ) ℓ,K ) , [ k ℓ,i ( z ∗ ℓ )] m = k ℓ,i ( z ∗ ℓ , z ( m ) ℓ,K ) . Collecting neuron-wise v ariances yields the diagonal layer cov ariance Σ ℓ ( z ∗ ℓ ) = diag  σ 2 ℓ, 1 ( z ∗ ℓ ) , . . . , σ 2 ℓ,d ℓ ( z ∗ ℓ )  ∈ R d ℓ × d ℓ . T able 2. Computational comple xity per layer ℓ . Exact GP refers to conditioning on all N ℓ cached activ ations. I kmeans denotes the number of k -means iterations. Operation Exact GP GAP A (Ours) Preprocessing O ( N 3 ℓ ) O ( N ℓ I kmeans ) + O ( M ℓ log M ℓ ) Inference (per query) O ( N 2 ℓ ) O (log ( M ℓ )) + O ( K 3 ) Memory O ( N 2 ℓ ) O ( M ℓ ) Conservati ve uncertainty (monotonicity). Intuitively , conditioning a GP on fe wer points cannot reduce posterior uncertainty . Fix kernel hyperparameters and observation noise σ 2 n > 0 , and let A ⊆ C be two conditioning sets. Then for any test input z ∗ , V ar  f ( z ∗ ) | D A  ≥ V ar  f ( z ∗ ) | D C  . Consequently , conditioning on the local subset Z ℓ,K ( z ∗ ℓ ) ⊆ Z ℓ cannot underestimate epistemic uncertainty relativ e to using the full inducing set. A formal statement and proof are provided in Appendix A; we additionally ablate sensiti vity to K in Appendix M.5. Computational complexity . Of ﬂine preprocessing com- prises caching ˜ Z ℓ (one forward pass), constructing Z ℓ via k -means, and building a F AISS index o ver Z ℓ , where typi- cally K ≪ M ℓ ≪ N ℓ . At test time, each query requires (i) neighbour search in O (log M ℓ ) and (ii) solving a K × K system in Eq. (4) . W ith ﬁxed K (we use K = 50 ), the per-query linear algebra is constant-size and the dominant dependence on the inducing set size is O (log M ℓ ) ; memory is linear in M ℓ . See T able 2 for a summary . 2.4. V ariance Propagation Thr ough the Network Gaussian ﬂow intuition. Once we replace deterministic acti vations with GAP A modules, the forw ard pass no longer carries only a point value. Instead, at any layer we track a Gaussian summary of the hidden state: a mean vector and a diagonal cov ariance matrix Σ h . Concretely , each GAP A layer maps a (possibly uncertain) pre-acti vation input to a Gaussian output, so uncertainty can be pr opagated forwar d through the remaining frozen layers. Specialized rules for architectures such as self-attention and RMSNorm are provided in Appendix K. Notation. W e denote by µ h the mean of a vector -valued random variable h and by Σ h its covariance matrix. W e write the variance vector v h := diag ( Σ h ) , whose entries are [ v h ] i = V ar( h i ) . For v ectors a , b of the same size, a ⊙ b denotes the Hadamar d (element-wise) product, and a ⊙ 2 := a ⊙ a . (i) Linear layers. Consider a linear transformation z = Wh + b where h has diagonal cov ariance. Under the diag- 4 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks onal approximation, the output v ariance remains diagonal and the variance v ector propagates as v z = ( W ⊙ W ) v h . (5) Intuitiv ely , each output coordinate z i = P j W ij h j is a weighted sum of independent components, so its variance is the sum of squared weights times input variances. (ii) Element-wise nonlinearities (delta method). W e use ﬁrst-order moment propagation (delta method): means fol- low the deterministic forward pass, while v ariances are updated by local linearization. Let y = g ( z ) where z ∼ N ( µ, σ 2 ) and g is a scalar nonlinearity (e.g., ReLU, tanh ). W e linearize around µ : g ( z ) ≈ g ( µ ) + g ′ ( µ )( z − µ ) . This giv es the standard delta-method moments: E [ y ] ≈ g ( µ ) , V ar( y ) ≈ ( g ′ ( µ )) 2 σ 2 . (6) Applied element-wise to vectors y = g ( z ) with diagonal variances, yields µ y = g ( µ z ) , v y ≈  g ′ ( µ z )  ⊙ 2 ⊙ v z . (7) Intuitiv ely , a nonlinearity either ampliﬁes or attenuates un- certainty depending on its local slope. (iii) Stacking GAP A layers (noisy-input correction). When we place multiple GAP A layers in a network, the input to a do wnstream GAP A layer is no longer a point z ℓ but a distribution (summarized as a Gaussian with mean µ z and variance v ector v z ). Standard GP conditioning as- sumes a deterministic test input; here the test-time input is itself uncertain. Intuiti vely , even if the GP were e valuated at the same mean location, uncertainty in the input location induces additional variability in the output whenever the GP mean changes with z . Follo wing the noisy-input GP (NIGP) approximation (McHutchon and Rasmussen, 2011), we capture this ef fect by adding a ﬁrst-order correction term. Concretely , for neuron i at layer ℓ , we ev aluate the local inducing-point posterior variance at the mean input and add the NIGP correction σ 2 ℓ,i ( z ℓ ) ≈ σ 2 epi ,ℓ,i  µ z  | {z } epistemic + λ ℓ,i  µ z  | {z } input uncertainty + σ 2 y ,i |{z} aleatoric , where σ 2 epi ,ℓ,i ( µ z ) is the local GP variance from Eq. (4) computed using the K -NN subset selected at µ z , and λ ℓ,i  µ z  =  ∇ z ℓ µ ℓ,i ( µ z )  ⊤ diag( v z )  ∇ z ℓ µ ℓ,i ( µ z )  is the NIGP correction induced by input uncertainty . In our mean-preserving element-wise setting µ ℓ,i ( z ) = ϕ ℓ ( z i ) , this simpliﬁes to λ ℓ,i ( µ z ) = ( ϕ ′ ℓ ( µ z ,i )) 2 v z ,i . W e set σ 2 y ,i = 0 for classiﬁcation; for regression we add a learned heteroscedastic noise head (Sec. 2.5). 2.5. Hyperparameter Strategy A key design choice in GAP A is to ﬁx GP hyperparameters rather than optimize them. Empirical (post-hoc) hyperparameters. GAP A is de- signed as a post-hoc module for pretrained neural networks: after standard training, we attach GAP A using only one for- ward pass on the backbone’ s input training data (or a subset thereof) to cache acti vations, without any additional back- propagation, ﬁne-tuning or labels. Accordingly , we set GP hyperparameters once from simple empirical statistics of the cached pre-activ ations. Because hyperparameters are es- timated from activ ation statistics rather than a task-speciﬁc objectiv e, the method is task-agnostic: the same procedure applies across settings (e.g., regression, classiﬁcation, token prediction, and segmentation). Caching is a one-of f of ﬂine step, and test-time inference remains unchanged thereafter . Details of the empirical hyperparameter construction are giv en in Appendix G.1. W e use an RBF k ernel k ℓ,i ( z , z ′ ) = c 2 i exp  − ∥ z − z ′ ∥ 2 / (2 ℓ 2 i )  , with ( c 2 i , ℓ i ) set from cached acti vation statis- tics and a small jitter . Details are giv en in Appendix G.1, with downstream likelihoods for classiﬁcation and regres- sion described in Appendix I and Appendix G.2. 3. Results W e ev aluate GAP A as a post-hoc uncertainty module for fr ozen pretrained backbones across re gression (Sec. 3.1), classiﬁcation (Sec. 3.2), image segmentation (Appendix Sec. E), and language modeling (Sec. F). Ablations on layer placement, inducing-set size M , local subset size K , and uncertainty-score choices are provided in Appendix M. 3.1. Regression Setup and baselines. W e e valuate on three regression benchmarks using the original train/test splits: YEAR Pre- diction MSD, Airline (Hensman et al., 2013), and T axi (Salimbeni and Deisenroth, 2017). W e compare against: MAP (backbone model), Dropout (MC Dropout with multiple stochastic forward passes; (Gal and Ghahramani, 2016)), Ensemble (independently trained models; (Laksh- minarayanan et al., 2017)), and post-hoc last-layer Bayesian baselines ( LLA variants, ELLA , V aLLA ; (Ortega et al., 2023)). Metrics. Performance is ev aluated using Negati ve Log- Likelihood (NLL), Continuous Ranked Probability Score (CRPS), and the Centered Quantile Metric (CQM), with deﬁnitions in Appendix J.1. T able 3 shows a consistent pattern across all re gression benchmarks. GAP A achiev es the best NLL on all three 5 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks T able 3. Results on regression datasets. Best values are in purple , and second-best in teal . An asterisk (*) indicates a last-layer LLA variant. Results are averages o ver 5 random seeds; standard deviations ( < 10 − 3 in all cases) are omitted for bre vity . The full table with stds can be found in T able 7 in the Appendix. Model Airline Y ear T axi NLL CRPS CQM NLL CRPS CQM NLL CRPS CQM MAP 5.121 18.695 0.148 3.673 5.023 0.134 3.775 3.755 0.211 LLA Diag 5.125 18.648 0.143 3.647 4.917 0.088 3.722 3.990 0.257 LLA KF AC 5.127 18.631 0.142 3.648 4.915 0.086 3.706 3.986 0.256 LLA* 5.127 18.631 0.141 3.648 4.915 0.086 3.726 3.985 0.256 LLA*KF AC 5.127 18.631 0.141 3.648 4.914 0.086 3.726 3.985 0.256 ELLA 5.388 21.671 0.413 4.020 6.049 0.424 3.885 3.680 0.219 V aLLA100 4.963 18.814 0.099 3.515 5.004 0.047 3.235 3.999 0.149 V aLLA200 4.965 18.788 0.098 3.485 4.970 0.041 3.232 3.979 0.142 Dropout 5.102 19.066 0.938 3.689 5.128 0.939 3.849 4.592 0.951 Ensemble 5.053 18.205 0.933 3.639 4.833 0.938 3.631 3.384 0.961 GAP A 4.946 18.068 0.103 3.470 4.663 0.014 3.112 4.035 0.104 datasets, indicating the strongest overall ﬁt of the predicti ve distribution. Moreover , GAP A attains the best CQM overall, demonstrating consistently well-calibrated predictiv e quan- tiles. CRPS follows the same trend on Airline and Y ear, with a small degradation on T axi, where ELLA perform best. 3.2. Classiﬁcation W e no w e valuate classiﬁcation performance in terms of accuracy and calibration, and assess OOD detection via A UC using predicti ve entropy and B ALD. Metric deﬁnitions are provided in Appendix J.2. Baselines. W e compare against (i) the deterministic backbone ( MAP ); (ii) sampling-based baselines requir- ing multiple forward passes or multiple trainings ( MC Dropout (Gal and Ghahramani, 2016), Deep Ensem- bles (Lakshminarayanan et al., 2017)); (iii) post-hoc last- layer Bayesian baselines that Bayesianize only the head ( LLA / ELLA / V aLLA ; (Ortega et al., 2023)); and (iv) distance-/feature-based alternativ es including SNGP (Liu et al., 2020), MF-VI (Blundell et al., 2015), and a subset GP baseline based on the GP interpretation of local linearization. W e also report linear probing (Alain and Bengio, 2016) as a lightweight head-only baseline, and DDU (Mukhoti et al., 2023) as a post-hoc feature-density uncertainty estimator . MNIST / FMNIST (MLP). Follo wing Orteg a et al. (2023), we train a 2-layer MLP with 200 hidden units and tanh activ ations on MNIST (LeCun et al., 2002) and Fashion-MNIST (Xiao et al., 2017). W e e valuate OOD de- tection by swapping datasets: Fashion-MNIST is treated as OOD for MNIST (MNIST → FMNIST) and MNIST as OOD for Fashion-MNIST (FMNIST → MNIST), reporting A UR OC based on predictive entrop y and B ALD-OOD (T a- ble 4). Since GAP A is mean-preserving, it leaves the back- 50 100 150 Rotation (degrees) 0 5 NLL MNIST 50 100 150 Rotation (degrees) FMNIST MAP LLA Diag LLA KF AC LLA* LLA* KF AC ELLA V aLLA 100 V aLLA 200 Dropout GAP A Figure 3. Predictive NLL under rotation corruption for MNIST (left panel) and FMNIST (right panel); lower is better . Results are av eraged over 5 random seeds. bone’ s point predictions unchanged, and therefore matches the MAP classiﬁer in accuracy by construction. GAP A achiev es the best B ALD A UR OC in both directions and the best predictiv e-entropy A UROC on FMNIST → MNIST , while remaining competitiv e on MNIST → FMNIST . More- ov er, GAP A enables deterministic single-pass inference with near-MAP runtime (2.05s test time), yielding substantial speedups ov er Monte Carlo and curvature-based baselines. Robustness under distribution shift. W e e valuate robust- ness by rotating the test images by increasing angles and plotting predictiv e NLL for MNIST and FMNIST as shown in Figure 3. As rotation increases and inputs mo ve further from the training distribution, GAP A maintains competi- tiv e NLL under shifts and achie ves lo wer NLL than most baselines for large rotations, while appropriately increasing uncertainty . This indicates robust and well-calibrated be- haviour , with the model correctly identifying heavily rotated inputs as out-of-distribution. CIF AR-10 (pretrained ResNets). W e ev aluate CIF AR-10 with pretrained ResNet-20/32/44/56 backbones (He et al., 2016) and use SVHN as the OOD dataset. Here the cen- tral question is the r obustness–efﬁciency trade-of f : many strong OOD baselines are prohibiti vely e xpensiv e at infer- ence, while fast methods often sacriﬁce OOD detection. On representati ve ResNet-44/56 backbones (T able 5), GAP A at- tains strong OOD A UROC (0.931/0.953) with lo w test-time cost (2.85s/3.30s), compared to the most accurate-but-slo w baselines such as V aLLA and GP-subset (test costs in the 10 2 – 10 3 s range). Figure 4 makes this explicit, GAP A lies on, or very close to, the Pareto frontier , of fering one of the best OOD–inference cost trade-offs for all ResNet-20/32/44/56. 3.3. Language models W e attach GAP A post hoc to LLaMA-3.2-3B (hidden size 3072 ). For each chosen transformer block (we report layer indices), we log ∼ 12 M pre-activ ations on WikiT ext-103 (training split) at sequence length L = 96 and b uild a nearest-neighbor cache for uncertainty propagation. W e use KMeans as preprocessing step, in Appendix F we also 6 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks T able 4. Classiﬁcation performance and runtime on MNIST and Fashion-MNIST . Accuracy (ACC), negativ e log-likelihood (NLL), expected calibration error (ECE), OOD detection, and B ALD are reported. Best results are shown in purple and second-best in teal . All results are av eraged over 5 random seeds; standard de viations ( < 10 − 3 ) are omitted. T raining and test times are wall-clock seconds (K = 10 3 ). The full table with standard de viations is provided in T able 7 in the Appendix. Model MNIST FMNIST Metrics Time (s) Metrics T ime (s) A CC NLL ECE OOD B ALD T rain T est A CC NLL ECE OOD BALD T rain T est MAP (backbone) 0.978 0.068 0.005 0.919 0.919 — 1.24 0.859 0.392 0.007 0.846 0.821 — 1.20 LLA Diag 0.976 0.177 0.105 0.932 0.941 2.34K 2.39 0.856 0.421 0.057 0.872 0.873 2.34K 2.22 LLA KF A C 0.978 0.102 0.042 0.971 0.971 130.0 2.85 0.858 0.395 0.020 0.909 0.970 129.9 2.84 LLA* 0.978 0.070 0.009 0.924 0.924 42.0 4.6 0.859 0.395 0.019 0.850 0.716 42.0 4.7 LLA* KF A C 0.979 0.070 0.009 0.923 0.928 31.2 17.6 0.859 0.394 0.017 0.849 0.717 31.2 17.4 ELLA 0.978 0.068 0.005 0.919 0.912 821.8 148.7 0.859 0.392 0.007 0.846 0.765 827.1 149.9 V aLLA 100 0.978 0.068 0.005 0.919 0.934 2.19K 16.4 0.865 0.382 0.019 0.925 0.963 495.1 16.4 V aLLA 200 0.978 0.068 0.005 0.919 0.934 3.43K 18.3 0.867 0.378 0.020 0.937 0.970 767.7 19.3 Linear Probing 0.977 0.117 0.015 0.884 0.883 2.78K 3.6 0.858 0.395 0.048 0.785 0.776 2.64K 3.8 GPP 0.978 1.648 0.784 0.934 0.904 5.79K 23.5K 0.857 1.716 0.692 0.867 0.962 5.57K 2.27K Dropout 0.978 0.072 0.009 0.923 0.944 — 4.3 0.858 0.393 0.009 0.850 0.911 — 4.3 Ensemble 0.979 0.069 0.038 0.936 0.962 — 11.9 0.859 0.373 0.041 0.863 0.938 — 11.9 DDU (est.) 0.978 0.068 0.005 0.921 0.919 202.4 8.2 0.859 0.392 0.007 0.876 0.983 202.3 8.2 GAP A-Diag 0.978 0.073 0.016 0.963 0.976 91.9 2.05 0.859 0.390 0.009 0.941 0.993 92.1 2.05 GAP A-Full 0.978 0.072 0.013 0.969 0.983 96.1 8.92 0.859 0.388 0.009 0.990 0.997 96.1 8.91 1 0 0 1 0 1 1 0 2 1 0 3 T est time (log scale) 0.88 0.90 0.92 0.94 OOD AUROC ResNet-20 1 0 0 1 0 1 1 0 2 1 0 3 T est time (log scale) 0.88 0.90 0.92 ResNet-32 1 0 0 1 0 1 1 0 2 1 0 3 T est time (log scale) 0.86 0.88 0.90 0.92 ResNet-44 1 0 0 1 0 1 1 0 2 1 0 3 T est time (log scale) 0.92 0.93 0.94 0.95 0.96 ResNet-56 MF-VI SNGP GP (subset) LLA Diag LLA KF AC Sampled LLA V aLLA GAP A Figure 4. OOD detection vs inference cost on CIF AR-10. OOD A UR OC is plotted against test-time inference cost (log scale) for ResNet backbones. Dashed lines indicate Pareto frontiers (higher OOD, lo wer cost). GAP A consistently lies on the frontier, achie ving strong OOD performance at substantially lower inference cost than baselines. provide additional ablation experiments on the inducing point selection. Uncertainty propagation rules for key LLM model components, such as self attention or RMSNorm we detail in Appendix K. As in classiﬁcation, after variance propagation the output layer yields mean logits µ 1: k and diagonal logit v ariances v 1: k from a single forward pass. Predictive uncertainty is estimated via a vectorized reparameterization step ov er the top- k tokens per position ( k = 512 ). Speciﬁcally , we draw S = 512 samples ℓ ( s ) = µ 1: k + √ v 1: k ⊙ ϵ ( s ) with ϵ ( s ) ∼ N (0 , I ) and compute p ( s ) = softmax( ℓ ( s ) ) , to form ¯ p = 1 S P S s =1 p ( s ) . This procedure introduces no additional network e valuations; sampling occurs only in logit space. Uncertainty is then decomposed using entropy as TU = H ( ¯ p ) , A U = 1 S S X s =1 H ( p ( s ) ) , EU = TU − AU , where H ( p ) = − P v p v log p v . This corresponds to the mutual-information decomposition of uncertainty . W e deﬁne two datasets: ID (W ikiT ext-103, v alidation data) and OOD (OpenW ebT ext); each sequence is labeled y ∈ { 0 , 1 } . W e ﬁlter BOS/EOS and extra whitespaces around punctuation marks (W ikiT ext-103) to av oid tri vial cues; OpenW ebT ext is prepared analogously . Note that Open- W ebT ext is most likely not OOD for the pretrained LLaMA model itself; ho wev er, it is OOD relati ve to the employed method. The task is, giv en a sequence to distinguish these two classes based on the predictive distribution. For scoring, we compute EU,A U at e very position and av erage over the sequence (the GAP A-based method is bold); A UR OC is then computed against the sequence label. W e include last- layer Laplace (trained on W ikiT ext-103, training data) as additional baseline. Additionally , we include a temperature- scaling oracle that searches τ to maximize test A UR OC using − P V v =1 softmax( ℓ /τ ) v log softmax( ℓ /τ ) v . This is not a fair baseline (it tunes on the test metric) b ut provides an upper bound for a global rescaling of logits. 7 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks Figure 5. Left: Effect of number of inducing points N inducing and k (for nearest neighbor inducing points) on OOD detection task with GAP A at layer [27]. Right: ef fect of layer placement of GAP A at N inducing = 10 5 . In both experiments results are a veraged over 5 runs with 512 sequences each. In both panels we also show the ℓ/T opt bound (green) as an upper threshold of what can be achiev ed by global logits scaling. T able 5. CIF AR-10 results with ResNet-44/56 backbones. Metrics include NLL ( ↓ ), OOD detection ( ↑ ), and train/test runtime. Best and second-best values are highlighted in purple and teal . Results are a veraged ov er 5 seeds; stds ( < 10 − 3 ) are omitted. T imes are in seconds (K = 10 3 ). See T able 6 for additional ResNet e xperiments. Model ResNet-44 ResNet-56 Metrics T ime (s) Metrics Time (s) NLL OOD T rain T est NLL OOD T rain T est MAP 0.275 0.885 – 0.761 0.252 0.924 – 0.949 MF-VI 0.206 0.890 1.63 1.03 0.188 0.929 1.97 1.18 SNGP 0.242 0.901 35.0 2.89 0.229 0.940 43.5 3.01 GP (subset) 0.424 0.897 8.25K 357 0.403 0.936 8.42K 382 LLA Diag 0.218 0.860 40.4 0.947 0.195 0.923 40.67 1.12 LLA KF A C 0.213 0.855 63.1 3.62 0.193 0.917 71.3 4.13 LLA* 0.237 0.895 4.98K 0.962 0.213 0.934 5.55K 1.16 LLA* KF A C 0.232 0.894 58.0 1.97 0.202 0.933 62.2 2.18 ELLA 0.204 0.885 1.12K 78.3 0.187 0.924 1.13K 91.0 Sampled LLA 0.200 0.899 11.0K 2.51K 0.185 0.944 14.6K 2.84K V aLLA 200 0.201 0.928 16.7K 272.9 0.188 0.960 26.3K 363.8 GAP A (ours) 0.230 0.931 8.03 2.85 0.230 0.953 10.29 3.30 Fig. 5 (left) shows that GAP A-based EU surpasses the oracle logit-temperature bound once N inducing ≳ 10 3 , indicating that acti vation-space epistemics capture distributional shift not recoverable by any global rescaling of logits. Using higher number of local inducing points ( k = 50 ) improv es performance. Last-layer linear Laplace does not perform better than chance, we found the Fisher-matrix estimation to become close to 0. The results in Figure 5 (right) indicate that later layers improve performance with a noticeable drop in performance in the middle of the network. 4. Related W ork Uncertainty quantiﬁcation methods for deep networks dif fer primarily in wher e uncertainty is placed and the resulting test-time cost. Since we target fr ozen pr etrained backbones with single-pass inference, we focus on post-hoc Bayesian baselines, and defer a broader discussion (sampling-based, feature-based, calibration, and activ ation-space training methods) to Appendix N. Laplace approximations place a Gaussian posterior ov er weights via a local quadratic ap- proximation of the log posterior (MacKay, 1992; Ritter et al., 2018); last-layer v ariants Bayesianize only the head while keeping the feature extractor frozen, making them a standard post-hoc baseline. In contrast, GAP A places uncer - tainty in activation space and uses the original nonlinearity as the GP prior mean, preserving the frozen network’ s point predictions by construction while enabling deterministic single-pass uncertainty estimation. 5. Conclusion W e introduced GAP A , a post-hoc uncertainty quantiﬁcation method that places Bayesian modeling in activation space rather than weight space. By using the original nonlinear- ity as the GP prior mean, GAP A preserves frozen model predictions e xactly while pro viding principled epistemic uncertainty . Designed for modern deployment constraints, GAP A employs a scalable inducing-point approximation with local KNN conditioning, yielding O (log M ) inference and deterministic single-pass uncertainty estimation without sampling, retraining, or backpropagation. Across re gression, classiﬁcation, segmentation, and language modeling, GAP A matches or outperforms Laplace-family methods in calibra- tion and OOD detection while achie ving fa vorable Pareto trade-offs between uncertainty quality and inference cost, and scales to settings where last-layer methods become pro- hibitiv e (e.g., large-v ocabulary language models). GAP A ’ s primary limitation is the memory cost of storing inducing ac- tiv ations. Future work includes compressed or hierarchical indexing schemes and extending beyond diagonal covari- ances to capture structured inter-neuron dependencies while preserving scalability . 8 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks Impact Statement This work proposes a scalable, post-hoc method for uncer- tainty quantiﬁcation in pretrained neural networks. The contribution is methodological in nature and is intended to improv e the reliability and interpretability of model predic- tions. W e do not foresee signiﬁcant negati ve societal or ethical impacts beyond those commonly associated with the deployment of machine learning systems. References Moloud Abdar , F arhad Pourpanah, Sadiq Hussain, Dana Rezazade- gan, Li Liu, Mohammad Ghavamzadeh, Paul Fie guth, Xiaochun Cao, Abbas Khosravi, U Rajendra Acharya, et al. A revie w of uncertainty quantiﬁcation in deep learning: T echniques, appli- cations and challenges. Information fusion , 76:243–297, 2021. Guillaume Alain and Y oshua Bengio. Understanding interme- diate layers using linear classiﬁer probes. arXiv pr eprint arXiv:1610.01644 , 2016. Richard Bergna, Sergio Calvo-Ordonez, Felix L Opolka, Pietro Liò, and Jose Miguel Hernandez-Lobato. Uncertainty modeling in graph neural networks via stochastic differential equations. arXiv pr eprint arXiv:2408.16115 , 2024. Charles Blundell, Julien Cornebise, K oray Ka vukcuoglu, and Daan W ierstra. W eight uncertainty in neural network. In International confer ence on machine learning , pages 1613–1622. PMLR, 2015. Erik Daxberger , Agustinus Kristiadi, Alexander Immer , Runa Es- chenhagen, Matthias Bauer , and Philipp Hennig. Laplace redux- effortless bayesian deep learning. Advances in neural informa- tion pr ocessing systems , 34:20089–20103, 2021. Zhijie Deng, Feng Zhou, and Jun Zhu. Accelerated linearized Laplace approximation for Bayesian deep learning. Advances in Neural Information Pr ocessing Systems , 35:2695–2708, 2022. Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jef f Johnson, Gergely Szilv asy , Pierre-Emmanuel Mazaré, Maria Lomeli, Lu- cas Hosseini, and Hervé Jégou. The faiss library . IEEE T rans- actions on Big Data , 2025. Y arin Gal and Zoubin Ghahramani. Dropout as a bayesian ap- proximation: Representing model uncertainty in deep learning. In international confer ence on machine learning , pages 1050– 1059. PMLR, 2016. T ilmann Gneiting and Adrian E Raftery . Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association , 102(477):359–378, 2007. Robert B Gramacy and Daniel W Aple y . Local Gaussian process approximation for large computer experiments. J ournal of Computational and Graphical Statistics , 24(2):561–578, 2015. Chuan Guo, Geof f Pleiss, Y u Sun, and Kilian Q W einberger . On calibration of modern neural networks. In International confer- ence on machine learning , pages 1321–1330. PMLR, 2017. James Harrison, John Willes, and Jasper Snoek. V ariational Bayesian last layers. arXiv pr eprint arXiv:2404.11599 , 2024. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Pr oceedings of the IEEE confer ence on computer vision and pattern recognition , pages 770–778, 2016. James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv pr eprint arXiv:1309.6835 , 2013. Neil Houlsby , Ferenc Huszár , Zoubin Ghahramani, and Máté Lengyel. Bayesian activ e learning for classiﬁcation and prefer- ence learning. arXiv pr eprint arXiv:1112.5745 , 2011. Arthur Jacot, Franck Gabriel, and Clément Hongler . Neural tangent kernel: Con vergence and generalization in neural networks. Advances in neural information pr ocessing systems , 31, 2018. Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information pr ocessing systems , 30, 2017. Y ann LeCun, Léon Bottou, Y oshua Bengio, and Patrick Haf fner . Gradient-based learning applied to document recognition. Pr o- ceedings of the IEEE , 86(11):2278–2324, 2002. Jaehoon Lee, Y asaman Bahri, Roman Novak, Samuel S Schoen- holz, Jef frey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes. arXiv pr eprint arXiv:1711.00165 , 2017. Jeremiah Liu, Zi Lin, Shreyas Padhy , Dustin T ran, T ania Bedrax W eiss, and Balaji Lakshminarayanan. Simple and princi- pled uncertainty estimation with deterministic deep learning via distance awareness. Advances in neural information pr ocessing systems , 33:7498–7512, 2020. David JC MacKay . A practical Bayesian framework for backprop- agation networks. Neural computation , 4(3):448–472, 1992. W esley J Maddox, Pav el Izmailo v , T imur Garipov , Dmitry P V etrov , and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. Advances in neural information pr ocessing systems , 32, 2019. Andrew McHutchon and Carl Rasmussen. Gaussian process train- ing with input noise. Advances in neural information pr ocessing systems , 24, 2011. Jishnu Mukhoti, Andreas Kirsch, Joost V an Amersfoort, Philip HS T orr, and Y arin Gal. Deep deterministic uncertainty: A new simple baseline. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 24384–24394, 2023. Luis A Ortega, Simón Rodríguez Santana, and Daniel Hernández- Lobato. V ariational linearized Laplace approximation for Bayesian deep learning. arXiv preprint , 2023. Omkar M Parkhi, Andrea V edaldi, Andrew Zisserman, and CV Jawahar . Cats and dogs. In 2012 IEEE conference on computer vision and pattern r ecognition , pages 3498–3505. IEEE, 2012. Hippolyt Ritter , Aleksandar Botev , and Da vid Barber . A scalable laplace approximation for neural networks. In 6th international confer ence on learning repr esentations, ICLR 2018-conference trac k proceedings , v olume 6. International Conference on Rep- resentation Learning, 2018. 9 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks Olaf Ronneberger , Philipp Fischer , and Thomas Brox. U-net: Con- volutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention– MICCAI 2015: 18th international conference , Munich, Ger- many , October 5-9, 2015, pr oceedings, part III 18 , pages 234– 241. Springer , 2015. Hugh Salimbeni and Marc Deisenroth. Doubly stochastic vari- ational inference for deep Gaussian processes. Advances in neural information pr ocessing systems , 30, 2017. Michalis T itsias. V ariational learning of inducing variables in sparse Gaussian processes. In Artiﬁcial intelligence and statis- tics , pages 567–574. PMLR, 2009. Christopher W illiams and Matthias Seeger . Using the nyström method to speed up kernel machines. Advances in neural infor- mation pr ocessing systems , 13, 2000. Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a nov el image dataset for benchmarking machine learning algo- rithms. arXiv pr eprint arXiv:1708.07747 , 2017. 10 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks A ppendix T able of Contents Appendix A–C: theory; D–G: additional experiments; H–L: implementation details; M–N: supplementary results. Contents A Conservati veness of Subset GP Conditioning 13 B V ariational Inducing-Point Interpr etation and Zero-Noise Limit 13 C Derivation f or Stacking GAP A Layers 14 D ResNets Pretrained Neural Netw orks 15 E Image Segmentation 17 F LLaMA-3.2 Additional Results 18 G GAP A Hyperparameters 18 G.1 GAP A Empirical Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 G.2 Regression T raining Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 H Nearest–Neighbour Retrieval with Faiss 19 H.1 Index construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 H.2 Query procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 H.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 I Laplace-Bridge Approximation f or Classiﬁcation 20 J Metrics 20 J.1 Regression Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 J.2 Classiﬁcation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 K V ariance Propagation in T ransformer Architectur es 21 K.1 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 K.2 RMSNorm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 K.3 Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 L T ables with Standard Deviations 24 L.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 L.2 Feedforward Neural Network Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 L.3 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 11 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks M Ablation Studies 24 M.1 Where to put GAP A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 M.2 Number of inducing inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 M.3 Inducing point selection: KMeans vs. farthest-point sampling . . . . . . . . . . . . . . . . . . . . . . . . 25 M.4 Random vs Furthers Point Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 M.5 KNN Sweep: K = 1 to 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 N Extended Related W ork 27 12 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks A. Conservati veness of Subset GP Conditioning These sections provide theoretical justiﬁcation for the approximations used in GAP A. W e formally justify the claim made in Section 4.3 that conditioning the GP posterior on a subset of inducing inputs yields a conservati ve estimate of epistemic uncertainty . Lemma 1 (Conservati ve uncertainty under subset conditioning) . Consider a Gaussian process prior with ﬁxed kernel hyperparameters and Gaussian observation noise. Let ˜ Z denote a set of inducing inputs and let ˜ Z k ⊂ ˜ Z be any subset. Then, for any test input z ∗ , the posterior variance satisﬁes V ar  f ( z ∗ ) | ˜ Z  ≤ V ar  f ( z ∗ ) | ˜ Z k  . That is, conditioning on a subset of inducing inputs cannot r educe posterior variance r elative to conditioning on the full inducing set. Pr oof. Let A = ˜ Z k and B = ˜ Z \ ˜ Z k . Under the GP prior with Gaussian observation noise, the random v ariables  f ( z ∗ ) , y A , y B  are jointly Gaussian. The conditional covariance identity for jointly Gaussian v ariables giv es V ar( f ( z ∗ ) | y A ) − V ar( f ( z ∗ ) | y A , y B ) = K ∗ B | A  K B B | A + σ 2 n I  − 1 K B ∗| A , where K B B | A and K ∗ B | A denote conditional cov ariance matrices obtained via the Schur complement. Since K B B | A + σ 2 n I is positiv e deﬁnite, the right-hand side is positiv e semideﬁnite. Therefore, V ar( f ( z ∗ ) | y A , y B ) ≤ V ar( f ( z ∗ ) | y A ) , which establishes the result. This result implies that the local k -nearest-neighbour conditioning strategy used in GAP A yields a conservati ve approximation of epistemic uncertainty: posterior variance may be inﬂated due to discarded conditi B. V ariational Inducing-Point Interpr etation and Zero-Noise Limit In this appendix, we show that the inducing-point construction used in GAP A admits a v ariational interpretation in the sense of T itsias (Titsias, 2009), and that in the limit of v anishing observation noise the resulting posterior cov ariance reduces to the standard inducing-point conditional cov ariance. Setup. Consider a scalar Gaussian process f ( · ) ∼ G P  m ( · ) , k ( · , · )  , and a set of inducing inputs Z = { z m } M m =1 , u = f ( Z ) ∈ R M . Let z ∗ denote a test input. In GAP A, we condition the GP on noiseless pseudo-observations ˜ Y = m ( Z ) , corresponding to e valuating the prior mean function at the inducing inputs. For numerical stability , we introduce an auxiliary Gaussian noise term σ 2 n , which will be taken to zero. Exact GP conditional. Conditioning a GP on noisy observations ˜ Y at inputs Z yields the posterior covariance V ar  f ( z ∗ ) | ˜ Y  = k ( z ∗ , z ∗ ) − k ( z ∗ , Z )  K ZZ + σ 2 n I  − 1 k ( Z , z ∗ ) , (8) where K ZZ is the kernel matrix on the inducing inputs. 13 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks V ariational inducing-point posterior . Following T itsias (T itsias, 2009), the v ariational posterior ov er the inducing variables u is Gaussian, q ( u ) = N ( µ , A ) , with optimal parameters µ = K ZZ  K ZZ + σ − 2 n K ZZ  − 1 ˜ Y , A = K ZZ  K ZZ + σ − 2 n K ZZ  − 1 K ZZ . The corresponding variational predicti ve cov ariance at z ∗ is V ar q  f ( z ∗ )  = k ( z ∗ , z ∗ ) − k ( z ∗ , Z ) K − 1 ZZ k ( Z , z ∗ ) + k ( z ∗ , Z ) K − 1 ZZ A K − 1 ZZ k ( Z , z ∗ ) . (9) The ﬁnal term is the variational corr ection that distinguishes the variational posterior from the exact GP conditional. Zero-noise limit. W e now consider the limit σ 2 n → 0 . In this regime, A = K ZZ  K ZZ + σ 2 n I  − 1 K ZZ − − − − → σ 2 n → 0 K ZZ . Substituting into Eq. (9), the variational correction term becomes k ( z ∗ , Z ) K − 1 ZZ K ZZ K − 1 ZZ k ( Z , z ∗ ) = k ( z ∗ , Z ) K − 1 ZZ k ( Z , z ∗ ) , which exactly cancels the ne gative term in Eq. (9). Therefore, the variational predicti ve cov ariance reduces to V ar  f ( z ∗ )  = k ( z ∗ , z ∗ ) − k ( z ∗ , Z ) K − 1 ZZ k ( Z , z ∗ ) , (10) which is precisely the standard inducing-point conditional cov ariance. Implication for GAP A. Since GAP A conditions on noiseless pseudo-observ ations ˜ Y = ϕ ( Z ) with a prior mean m ( · ) = ϕ ( · ) , the v ariational correction vanishes in the zero-noise limit. Consequently , the posterior covariance used by GAP A coincides with the standard inducing-point GP conditional cov ariance, while preserving the deterministic backbone predictions exactly . C. Derivation f or Stacking GAP A Layers When GAP A layers are stacked, the output of a preceding GAP A layer becomes the input to a subsequent GAP A layer . Since each GAP A module produces a Gaussian output, the input to the current layer is itself a random variable. Let z ∼ N ( µ z , Σ z ) denote the pre-activ ation input to the current GAP A layer , where Σ z is diagonal under our approximation. W e write z = µ z + ε , ε ∼ N ( 0 , Σ z ) . W e consider neuron-wise propagation. For neuron i , the scalar input z i satisﬁes z i = µ z ,i + ε i , ε i ∼ N (0 , σ 2 z ,i ) . Base epistemic variance (deterministic input). Ignoring input uncertainty , the posterior v ariance of neuron i under the KNN GAP A approximation is gi ven by the standard inducing-point conditional v ariance restricted to the local neighborhood Z ℓ,K ( µ z ,i ) : σ 2 epi ,i ( µ z ,i ) = k i ( µ z ,i , µ z ,i ) − k ⊤ i  K i + σ 2 n I  − 1 k i , (11) where K i ∈ R K × K is the kernel matrix ov er the K nearest inducing inputs, and [ k i ] m = k i ( µ z ,i , z m ) . 14 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks T able 6. GAP A and baseline results on CIF AR-10 with ResNet backbones. This table reports the full results (including standard deviations) corresponding to T ables 9 and 10. Best results are shown in purple and second-best in teal . ResNet-20 ResNet-32 ResNet-44 ResNet-56 A CC NLL OOD T rain T est A CC NLL OOD T rain T est ACC NLL OOD T rain T est A CC NLL OOD T rain T est MAP 92.6 0.282 0.876 – – 93.5 0.292 0.909 – – 94.0 0.275 0.885 – 0.761 94.4 0.252 0.924 – 0.949 MF-VI 92.7 0.231 0.865 0.74 0.47 93.5 0.222 0.885 1.19 0.75 93.9 0.206 0.890 1.63 1.03 94.4 0.188 0.929 1.97 1.18 SNGP 92.4 0.266 0.875 15.9 1.31 93.2 0.256 0.890 25.5 2.10 93.8 0.242 0.901 35.0 2.89 93.8 0.229 0.940 43.5 3.01 GP (subset) 92.6 0.555 0.870 3.75K 162 93.4 0.462 0.885 6.00K 260 93.6 0.424 0.897 8.25K 357 94.4 0.403 0.936 8.42K 382 LLA Diag 92.6 0.260 0.866 18.4 0.43 93.5 0.242 0.882 29.4 0.69 94.0 0.218 0.860 40.4 0.947 94.3 0.195 0.923 40.67 1.12 LLA KF A C 92.6 0.241 0.877 28.7 1.65 93.5 0.229 0.903 45.9 2.63 94.0 0.213 0.855 63.1 3.62 94.4 0.193 0.917 71.3 4.13 Sampled LLA 92.5 0.231 0.885 5.00K 1.14K 93.5 0.217 0.905 8.00K 1.83K 94.0 0.200 0.899 11.0K 2.51K 94.4 0.185 0.944 14.6K 2.84K V aLLA 92.4 0.231 0.940 7.59K 124 93.2 0.212 0.933 12.2K 199 93.8 0.201 0.928 16.7K 272.9 94.2 0.188 0.960 26.3K 363.8 GAP A (ours) 92.6 0.258 0.907 3.65 1.30 93.5 0.259 0.926 5.84 2.07 94.0 0.230 0.931 8.03 2.85 94.4 0.230 0.953 10.29 3.30 Input uncertainty correction (NIGP). Because the test input is uncertain, we apply the noisy-input GP (NIGP) approxi- mation. Linearizing the posterior mean µ i ( z ) around µ z ,i yields an additional variance term λ i ( µ z ,i ) = σ 2 z ,i  ∂ µ i ( z ) ∂ z    z = µ z,i  2 . (12) By construction of GAP A, the posterior mean equals the original deterministic activ ation, µ i ( z ) = ϕ i ( z ) , so the gradient term reduces to the deriv ativ e of the activ ation function, and λ i ( µ z ,i ) = σ 2 z ,i  ϕ ′ i ( µ z ,i )  2 . T otal pr edictive variance. Combining the epistemic v ariance from the local inducing-point posterior , the NIGP correction due to input uncertainty , and optional observ ation noise σ 2 y ,i , the total predictiv e variance for neuron i is V ar[ y i ] = σ 2 epi ,i ( µ z ,i ) | {z } activ ation-space epistemic + σ 2 z ,i  ϕ ′ i ( µ z ,i )  2 | {z } propagated input uncertainty + σ 2 y ,i . (13) This expression sho ws that stacking GAP A layers preserves mean predictions exactly while propagating uncertainty forward in closed form, with each layer contrib uting additional epistemic v ariance and attenuating or amplifying uncertainty according to the local slope of the activ ation function. D. ResNets Pr etrained Neural Networks W e report supplementary experiments and visualizations omitted from the main text for space. ResNet backbones and computational trade-offs. T able 6 reports full CIF AR-10 results for pretrained ResNet backbones, including in-distribution performance (A CC/NLL), OOD detection (A UR OC), and train/test runtime. Across all depths, GAP A achieves competitiv e in-distrib ution accuracy and NLL while providing strong OOD detection. Notably , methods with the strongest OOD scores (e.g., V aLLA and sampled/GP-subset baselines) often incur substantially higher computational costs, both at test time and train time. Figure 4 visualises the robustness–ef ﬁciency trade-of f by plotting OOD A UR OC against test-time inference cost (log scale). Dashed lines indicate Pareto frontiers (higher OOD, lower cost), highlighting methods that offer the best compromise between reliability and efﬁcienc y . Across backbones, GAP A lies on the Pareto frontier , achieving strong OOD A UR OC while av oiding the 10 2 – 10 3 s inference costs characteristic of more computationally intensiv e Bayesian baselines. 15 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks ht Input X Label Y Prediction Y Error | Y Y | Epistemic 2 0 1 2 3 4 5 Epistemic MI 1e 6 Figure 6. Qualitative se gmentation results with pixel-wise error and epistemic uncertainty . Columns : (1) Input image X , (2) Ground-truth mask Y , (3) Predicted mask ˆ Y , (4) Error map | Y − ˆ Y | , (5) Epistemic uncertainty (mutual information). Rows : three representativ e validation examples. 16 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks MAP (backbone) MC Dropout Last-Layer Laplace GAP A (Ours) Figure 7. Regression predictions and uncertainty: (a) MAP backbone, (b) MC Dropout, (c) last-layer Laplace, (d) GAP A (ours). E. Image Segmentation As a proof of concept for high-dimensional outputs, we apply GAP A to a U-Net model (Ronneberger et al., 2015) pre-trained on the Oxford-IIIT Pet dataset (Parkhi et al., 2012) for a 3-class segmentation task (background, pet, outline) with input images resized to 128 × 128 . The U-Net architecture features an encoder path with two do wnsampling stages (32 and 64 channels, using double con volutions and max pooling), leading to a bottleneck with 128 channels. From this bottleneck, an embedding head comprising adapti ve a verage pooling and a linear layer projects the features to a d = 64 dimensional embedding vector . Standard skip connections are used in the decoder path. For these experiments, GAP A was applied to this d = 64 dimensional embedding vector at the U-Net bottleneck. This vector represents the most compressed representation in the network, and its 1D nature (after pooling and ﬂattening). The GAP A-processed embedding (mean preserved, v ariance added) is then reshaped and fed into the decoder to produce the ﬁnal segmentation map. The dimensionality of the full segmentation output space (e.g., 128 × 128 × 3 or ∼ 224 × 224 per image if referring to original dataset paper’ s output size before your resize) renders methods like full Laplace approximation computationally infeasible due to memory and time constraints (e.g., matrix inv ersions on O (10 5 ) outputs or more). In contrast, applying GAP A at the compressed embedding stage scales ef ﬁciently . Figure 6 demonstrates that this approach not only produces accurate segmentation masks but also generates spatially localized epistemic uncertainty maps that precisely highlight regions where prediction errors occur . 17 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks F . LLaMA-3.2 Additional Results In Figure 8 we show an additional ablation study where we compare inducing point selection using Kmeans or random choice. W e ﬁnd that random underperformed compared to KMeans . Figure 8. Effect of the number of inducing points N inducing and preprocessing strategy on OOD detection task. W e plot the A UC using EU (blue) with GAP A at layer [27]. Results are av eraged over 5 runs with 512 sequences each. In both panels we also sho w the ℓ/T opt bound (green) as an upper threshold of what can be achiev ed by global logits scaling. G. GAP A Hyperparameters This section details hyperparameters, inference procedures, and architectural propagation rules. G.1. GAP A Empirical Hyperparameters For the GAP A , we deliberately a void any gradient–based h yper-parameter optimisation. Instead, the RBF-kernel length scale ℓ k , signal amplitude σ k , and the set of pseudo-inputs Z are ﬁxed once from simple empirical statistics of the training data. Length scale ℓ k . W e set ev ery neuron’ s length scale to the empirical median of all pairwise Euclidean distances between training inputs: d ij = ∥ x i − x j ∥ 2 , ℓ k = Median  { d ij }  . In our implementation we approximate this by sampling 10 6 random pairs. Signal variance σ 2 k . For each hidden neuron we compute the sample standard de viation of its pre-activ ations ov er the training set: σ k = Std  h k ( x i )  N i =1 . Clamped to a minimum of 10 − 6 to ensure numerical stability . Pseudo-inputs Z . W ith a budget of M inducing points we perform a greedy f arthest-ﬁrst tra versal ov er the training inputs: 1. Select an arbitrary z 1 from the training set. 2. For m = 2 , . . . , M , choose z m as the training input whose minimum Euclidean distance to { z 1 , . . . , z m − 1 } is maximal. KMeans pseudo-inputs. As an alternati ve to farthest-ﬁrst traversal, we also pro vide a KMeans-based strategy for selecting inducing points. In this variant, the pseudo-input set Z consists of the M cluster centroids obtained by running KMeans on the training activ ations. W e initialise the clustering using the standard KMeans++ seeding procedure: the ﬁrst centre is chosen uniformly at random, and each subsequent centre is selected with probability proportional to its squared distance from the closest existing centre. This produces well-separated initial centroids and improv es stability and conv ergence compared to random initialisation. 18 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks KMeans provides a simple, task-agnostic alternative to farthest-ﬁrst trav ersal, and can be used interchangeably within GAP A for constructing Z . G.2. Regression T raining Details For re gression, we parameterize the aleatoric variance using a small MLP head s ψ that takes hidden representations as input: σ 2 ale ( x ) = softplus( s ψ ( x )) + ε where ε = 10 − 6 is a variance ﬂoor pre venting numerical instability . The total predicti ve variance combines epistemic (from GAP A) and aleatoric components: σ 2 tot ( x ) = σ 2 epi ( x ) + σ 2 ale ( x ) W e train only the parameters ψ by minimizing: L reg = 1 N N X n =1  ( y n − µ n ) 2 2 σ 2 tot ( x n ) + 1 2 log(2 π σ 2 tot ( x n ))  where µ n is the ﬁxed mean prediction from the frozen backbone. This preserves exact mean predictions while learning data-dependent noise. H. Nearest–Neighbour Retrie val with Faiss GAP A requires fast retriev al of the K nearest inducing activ ations in activ ation space. Giv en a set of inducing inputs Z = { z m } M m =1 ⊂ R d , for each query activ ation z ∗ we compute N K ( z ∗ ) = arg min S ⊂{ 1 ,...,M } | S | = K X m ∈ S ∥ z ∗ − z m ∥ 2 . A brute-force search costs O ( M d ) per query . Instead, we use Faiss to support ef ﬁcient approximate nearest-neighbour retriev al with sublinear complexity . H.1. Index construction 1. Choice of index. For small to moderate inducing sets we use IndexFlatL2 (exact search). For large M , we use IndexIVFPQ , which partitions the space via coarse k -means clustering and stores PQ-compressed residuals (Douze et al., 2025). 2. T raining (optional). Quantization-based indices (IVF , PQ) require a one-time ofﬂine training step on a representati ve subset of inducing activ ations. 3. Adding vectors. All inducing activations z m are added once to the index; their identiﬁers link back to the cached kernel statistics required for GP inference. The resulting index requires O ( M ) memory and supports K -NN queries in O (log M ) expected time for IVF-based indices. H.2. Query procedur e For each test-time acti vation z ∗ : 1. Query the Faiss inde x to retrieve the K nearest inducing inputs: ( d , N K ) ← index.search ( z ∗ , K ) . 2. Form the local inducing set Z K ( z ∗ ) = { z m : m ∈ N K } . 3. Compute the neuron-wise posterior variance using the standard inducing-point conditional restricted to this local set: V ar[ f ( z ∗ )] = k ( z ∗ , z ∗ ) − k ⊤  K K + σ 2 n I  − 1 k , where K K is the K × K kernel matrix on Z K ( z ∗ ) and [ k ] m = k ( z ∗ , z m ) . By construction, the posterior mean remains equal to the original deterministic acti vation, so nearest-neighbour retrieval affects only the epistemic v ariance. 19 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks H.3. Complexity • Inde x construction: one-off O ( M d ) time and O ( M ) memory . • Query: O (log M ) approximate neighbour search plus O ( K 3 ) local linear algebra, with K ≪ M ﬁxed. This design keeps test-time GP inference independent of the total number of cached activ ations while preserving a principled, distance-aware posterior v ariance. I. Laplace-Bridge A pproximation f or Classiﬁcation Giv en mean logits µ ∈ R C and per-class v ariances v ∈ R C from GAP A propagation, we compute predicti ve probabilities using: p ( y = c | x ) ≈ exp  µ c  p 1 + ( π / 8) v c  P C c ′ =1 exp  µ c ′  p 1 + ( π / 8) v c ′  (14) The division and square root are applied element-wise to each logit before the softmax. This approximation integrates Gaussian logit uncertainty into cate gorical predictions without sampling, deriv ed from the probit approximation Φ( x ) ≈ σ ( x p π / 8) where Φ is the Gaussian CDF and σ is the sigmoid function. J. Metrics J.1. Regr ession Metrics For e v aluating performance on regression tasks (Section 3.1), we use several key metrics. First, the Negati ve Log-Likelihood (NLL) measures the quality of the predictiv e probability distribution. Assuming a Gaussian predictive distribution p ( y | x ) = N ( y ; µ ( x ) , σ 2 ( x )) , where µ ( x ) is the predicted mean and σ 2 ( x ) is the predicted v ariance, the NLL for a true target value y true is 1 2 log(2 π σ 2 ( x )) + ( y true − µ ( x )) 2 2 σ 2 ( x ) . Lower NLL v alues are better, indicating that the predictiv e distribution is both accurate and appropriately conﬁdent. Second, the Continuous Ranked Probability Scor e (CRPS) (Gneiting and Raftery, 2007) generalizes the Mean Absolute Error (MAE) to probabilistic forecasts. For a predictiv e cumulative distribution function (CDF) F and a true outcome y true , it is deﬁned as CRPS ( F , y true ) = R ∞ −∞ ( F ( y ) − 1 { y ≥ y true } ) 2 dy , where 1 {·} is the indicator function. For a Gaussian predictive distrib ution N ( µ, σ 2 ) , a closed-form e xpression exists. Lower CRPS values are better, indicating a sharper and more calibrated predictive distrib ution. Finally , the Center ed Quantile Metric (CQM) , as proposed by Ortega et al. (2023), e valuates the calibration of speciﬁc quantiles of the predicti ve distribution. It typically focuses on ho w well the predicted quantiles (e.g., the 5th and 95th percentiles) align with the empirical frequency of observat ions falling below these quantiles. A common formulation might assess the a verage miscalibration across symmetric quantiles, where lower CQM v alues generally indicate better quantile calibration. J.2. Classiﬁcation Metrics For e valuating performance on classiﬁcation tasks (Section 3.2), we use se veral key metrics. Accuracy (A CC) is the ov erall proportion of correctly classiﬁed samples; we note that GAP A, by design, preserves the mean predictions of the backbone network, so its A CC should match that of the original pre-trained model unless other methods being compared modify these predictions. The Negative Log-Lik elihood (NLL) , in classiﬁcation, is equiv alent to the cross-entropy loss and measures the quality of the predictiv e probability distribution. For a gi ven sample with true class label y true (out of C classes) and where the model predicts a probability distrib ution p ( y | x ) ov er the classes, the NLL for that sample is speciﬁcally − log p ( y true | x ) , which is the neg ative logarithm of the probability assigned by the model to the correct class; lower v alues indicate better performance. Expected Calibration Error (ECE) measures the discrepanc y between-+ a model’ s predicted conﬁdences and its empirical accuracies. Predictions are typically binned by their conﬁdence scores. For each bin B m , the accuracy acc( B m ) and av erage conﬁdence conf ( B m ) are computed. ECE is then a weighted av erage of the absolute difference: P M m =1 | B m | N | acc( B m ) − conf ( B m ) | , where N is the total number of samples; lower v alues indicate better calibration. For Out-of-Distribution (OOD) Detection , we report the Area Under the R OC curve (A UC). This ev aluates the model’ s ability to distinguish between in-distribution (ID) and out-of-distribution (OOD) samples based on an uncertainty score. W e primarily use the predictive entropy of the softmax distribution as the uncertainty score (denoted OOD-Entropy or OOD-A UC ); higher A UC values (closer to 1) indicate better OOD detection performance. W e also e valuate OOD Detection A UC with B ALD (OOD-BALD) , which is similar to the above, b ut the uncertainty score used for OOD detection is the 20 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks Bayesian Activ e Learning by Disagreement (BALD) score (Houlsby et al., 2011). BALD measures the mutual information between the model’ s predictions and its parameters, often providing a better measure of epistemic uncertainty; a higher A UC indicates better OOD detection using B ALD. K. V ariance Propagation in T ransformer Ar chitectures T o implement variance propagation in transformers, in addition to the classical linear layers or acti vation, we need three additional propagation rules: RMSNorm , CausalSelfAttention and Softmax . K.1. Attention Here, we present two v ariants to propagate the variance through a self-attention layer . Giv en an input vector x ∈ R d with per–feature variances V ar( x j ) = v j , we ﬁrst form the standard query/k ey/value projections q = W Q x, k = W K x, v = W V x, with V ar( q i ) = d X j =1 ( W Q ij ) 2 v j , V ar( k i ) = d X j =1 ( W K ij ) 2 v j , V ar( v i ) = d X j =1 ( W V ij ) 2 v j . V ariant A. W e treat the attention weights a ts as deterministic, and propagate akin to a linear layer propagation: V ar( y t,i ) = X s a 2 ts V ar( v s,i ) . V ariant B. Let d k be the head dimension and deﬁne the scaled logits e ts = d − 1 / 2 k q ⊤ t k s . Under the delta method V ar( a ts ) = 1 d k d k X h =1  q 2 t,h V ar( k s,h ) + k 2 s,h V ar( q t,h ) + V ar( q t,h ) V ar( k s,h )  . After masking and applying the soft-max propagation rule of Appendix K.3 we obtain V ar( a ts ) . The variance of the head output is then V ar( y t,i ) = X s h V ar( a ts ) v 2 s,i + a 2 ts V ar( v s,i ) + V ar( a ts ) V ar( v s,i ) i . While the second method is arguably modeling the overall v ariance propagation in a more sophisticated way , from a design choice perspecti ve, the decision is not ob vious: the ﬁrst propagation scheme is much faster . Although we weren’t directly able to use ﬂash attention, in theory a FlashAttention kernel could be modded to calculate the squared attention operation on-the-ﬂy at no additional cost. Secondly we found that the variances can grow quickly the more layer the transformer model has because of the compounding, multiplicativ e ef fect of the v ariance over both the attention scores and the query , ke y and v alues. This compounding ef fect could be addressed with full covariance propagation, ho wever , for transformers the embedding space is large (e.g. for LLaMA-3.2, 3B it is 4 , 096 ), making a full cov ariance treatment computationally intractable (and low-rank approximations either still exhibit compounding ef fect (if too lo w-rank) or again intractable). Based on these reasons in the paper use V ariant A: we will assume the model knows where to look (deterministic attention weights), but is uncertain about what it sees there (v alue variance). K.2. RMSNorm Let x ∈ R d with per-feature v ariances V ar( x j ) = v j . RMSNorm computes the root mean square RMS ( x ) = v u u t 1 d d X j =1 x 2 j + ε, 21 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks and applies the transformation y i = γ i · x i RMS ( x ) , where γ i are learned scale parameters and ε > 0 is a small constant for numerical stability . As a ﬁrst-order approximation we deﬁne the expected RMS squared as s 2 = E h 1 d d X j =1 x 2 j i + ε. Using the identity E [ x 2 j ] = V ar( x j ) + E [ x j ] 2 , we can rewrite this as s 2 = 1 d d X j =1 E [ x 2 j ] + ε = 1 d d X j =1  v j + E [ x j ] 2  + ε. Under this approximation, we treat s 2 as deterministic and propagate v ariance as V ar( y i ) ≈ γ 2 i s 2 v i . The following PyT orch implementation realizes the simpliﬁed scheme: frame=lines, framesep=2mm, baselinestretch=1.2, bgcolor=LightGray, fontsize=\footnotesize, linenos ]{python} class RMSNormVar(torch.nn.Module): def __init__(self, dim: int , eps: float = 1e-6): super ().__init__() self.eps = eps self.weight = nn.Parameter(torch.ones(dim)) def _norm(self, x): return x * torch.rsqrt(x. pow (2).mean(-1, keepdim=True) + self.eps) def forward(self, input_mean, input_var): """ Args: input_mean (torch.Tensor): Input means, shape [batch_size, ..., feature_dim] input_var (torch.Tensor): Input variances, shape same as input_mean Returns: tuple: (output_mean, output_var) """ output_mean = self._norm(input_mean) * self.weight # Compute expected rms squared expected_rms_squared = ( input_mean. pow (2).mean(dim=-1, keepdim=True) + input_var.mean(dim=-1, keepdim=True) + self.eps ) 22 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks # Compute output variances output_var = input_var / expected_rms_squared * self.weight. pow (2) return output_mean, output_var K.3. Softmax For softmax we follow the Delta method approach. W e note that this method is only used for the second v ariant of SelfAttention , whereas in this paper we use the ﬁrst variant. Let x ∈ R K with per–feature variances V ar( x i ) = v i . The softmax output is s k = e x k P K j =1 e x j . The Jacobian of the softmax for ﬁxed k is ∂ s k ∂ x i = s k ( δ ik − s i ) . Applying the Delta method with Σ x = diag( v 1 , . . . , v K ) gi ves V ar( s k ) = K X i =1  s k ( δ ik − s i )  2 v i . If we split out the i = k term and the i  = k terms, this expands to V ar( s k ) = s 2 k (1 − s k ) 2 v k + X i  = k s 2 k s 2 i v i = s 2 k h (1 − s k ) 2 v k + X i  = k s 2 i v i i . [ frame=lines, framesep=2mm, baselinestretch=1.2, bgcolor=LightGray, fontsize=\footnotesize, linenos ] {python} def softmax_var(y_mean, x_var, axis=-1): y = y_mean.transpose(axis, -1) v = x_var.transpose(axis, -1) W = y. pow (2) * v S = W. sum (dim=-1, keepdim=True) sum_excluding_k = S - W diag_term = (1 - y). pow (2) * v var_last = y. pow (2) * (diag_term + sum_excluding_k) return var_last.transpose(-1, axis) 23 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks T able 7. Results on regression datasets with standard de viations (in × 10 − 3 units). Best v alues are in purple , and second-best in teal . An asterisk (*) indicates a last-layer LLA variant. Results are averages ov er 5 random seeds. This is the full version of T able 3 with stds included. Model Airline Y ear T axi NLL CRPS CQM NLL CRPS CQM NLL CRPS CQM MAP (backbone) 5.121 ( ± 0.5) 18.695 ( ± 0.6) 0.148 ( ± 0.4) 3.673 ( ± 0.4) 5.023 ( ± 0.5) 0.134 ( ± 0.3) 3.775 ( ± 0.5) 3.755 ( ± 0.4) 0.211 ( ± 0.4) LLA Diag 5.125 ( ± 0.4) 18.648 ( ± 0.5) 0.143 ( ± 0.3) 3.647 ( ± 0.3) 4.917 ( ± 0.4) 0.088 ( ± 0.2) 3.722 ( ± 0.4) 3.990 ( ± 0.5) 0.257 ( ± 0.3) LLA KF A C 5.127 ( ± 0.3) 18.631 ( ± 0.4) 0.142 ( ± 0.3) 3.648 ( ± 0.3) 4.915 ( ± 0.4) 0.086 ( ± 0.2) 3.706 ( ± 0.3) 3.986 ( ± 0.4) 0.256 ( ± 0.3) LLA* 5.127 ( ± 0.4) 18.631 ( ± 0.5) 0.141 ( ± 0.3) 3.648 ( ± 0.3) 4.915 ( ± 0.4) 0.086 ( ± 0.2) 3.726 ( ± 0.4) 3.985 ( ± 0.5) 0.256 ( ± 0.3) LLA* KF A C 5.127 ( ± 0.3) 18.631 ( ± 0.4) 0.141 ( ± 0.3) 3.648 ( ± 0.3) 4.914 ( ± 0.4) 0.086 ( ± 0.2) 3.726 ( ± 0.4) 3.985 ( ± 0.4) 0.256 ( ± 0.3) ELLA 5.388 ( ± 0.6) 21.671 ( ± 0.7) 0.413 ( ± 0.5) 4.020 ( ± 0.5) 6.049 ( ± 0.6) 0.424 ( ± 0.4) 3.885 ( ± 0.5) 3.680 ( ± 0.4) 0.219 ( ± 0.4) V aLLA 100 4.963 ( ± 0.3) 18.814 ( ± 0.5) 0.099 ( ± 0.2) 3.515 ( ± 0.3) 5.004 ( ± 0.5) 0.047 ( ± 0.2) 3.235 ( ± 0.3) 3.999 ( ± 0.4) 0.149 ( ± 0.2) V aLLA 200 4.965 ( ± 0.3) 18.788 ( ± 0.4) 0.098 ( ± 0.2) 3.485 ( ± 0.3) 4.970 ( ± 0.4) 0.041 ( ± 0.2) 3.232 ( ± 0.3) 3.979 ( ± 0.4) 0.142 ( ± 0.2) Dropout 5.102 ( ± 0.5) 19.066 ( ± 0.6) 0.938 ( ± 0.5) 3.689 ( ± 0.5) 5.128 ( ± 0.5) 0.939 ( ± 0.4) 3.849 ( ± 0.6) 4.592 ( ± 0.6) 0.951 ( ± 0.5) Ensemble 5.053 ( ± 0.4) 18.205 ( ± 0.5) 0.933 ( ± 0.4) 3.639 ( ± 0.4) 4.833 ( ± 0.5) 0.938 ( ± 0.4) 3.631 ( ± 0.5) 3.384 ( ± 0.5) 0.961 ( ± 0.4) GAP A (ours) 4.946 ( ± 0.3) 18.068 ( ± 0.4) 0.103 ( ± 0.3) 3.470 ( ± 0.3) 4.663 ( ± 0.4) 0.014 ( ± 0.2) 3.112 ( ± 0.3) 4.035 ( ± 0.4) 0.104 ( ± 0.2) T able 8. Results on classiﬁcation datasets with standard de viations (in × 10 − 3 units). Best v alues are in purple , second-best in teal . V alues are av erages over 5 random seeds; consistent with < 10 − 3 in all cases. Model MNIST FMNIST A CC NLL ECE OOD BALD A CC NLL ECE OOD BALD MAP 0.978 ( ± 0.4) 0.068 ( ± 0.2) 0.005 ( ± 0.3) 0.919 ( ± 0.5) 0.919 ( ± 0.4) 0.859 ( ± 0.3) 0.392 ( ± 0.6) 0.007 ( ± 0.3) 0.846 ( ± 0.5) 0.821 ( ± 0.5) LLA Diag 0.976 ( ± 0.5) 0.177 ( ± 0.5) 0.105 ( ± 0.6) 0.932 ( ± 0.6) 0.941 ( ± 0.5) 0.856 ( ± 0.4) 0.421 ( ± 0.5) 0.057 ( ± 0.4) 0.872 ( ± 0.5) 0.873 ( ± 0.6) LLA KF AC 0.978 ( ± 0.4) 0.102 ( ± 0.4) 0.042 ( ± 0.4) 0.971 ( ± 0.3) 0.971 ( ± 0.4) 0.858 ( ± 0.4) 0.395 ( ± 0.5) 0.020 ( ± 0.3) 0.909 ( ± 0.4) 0.970 ( ± 0.5) LLA* 0.978 ( ± 0.4) 0.070 ( ± 0.3) 0.009 ( ± 0.3) 0.924 ( ± 0.5) 0.924 ( ± 0.5) 0.859 ( ± 0.4) 0.395 ( ± 0.5) 0.019 ( ± 0.3) 0.850 ( ± 0.5) 0.716 ( ± 0.5) LLA* KF AC 0.979 ( ± 0.3) 0.070 ( ± 0.3) 0.009 ( ± 0.2) 0.923 ( ± 0.4) 0.928 ( ± 0.5) 0.859 ( ± 0.4) 0.394 ( ± 0.5) 0.017 ( ± 0.3) 0.849 ( ± 0.4) 0.717 ( ± 0.6) ELLA 0.978 ( ± 0.4) 0.068 ( ± 0.3) 0.005 ( ± 0.2) 0.919 ( ± 0.4) 0.912 ( ± 0.5) 0.859 ( ± 0.4) 0.392 ( ± 0.5) 0.007 ( ± 0.3) 0.846 ( ± 0.4) 0.765 ( ± 0.6) V aLLA 100 0.978 ( ± 0.3) 0.068 ( ± 0.3) 0.005 ( ± 0.2) 0.919 ( ± 0.4) 0.934 ( ± 0.4) 0.865 ( ± 0.3) 0.382 ( ± 0.4) 0.019 ( ± 0.3) 0.925 ( ± 0.4) 0.963 ( ± 0.5) V aLLA 200 0.978 ( ± 0.4) 0.068 ( ± 0.3) 0.005 ( ± 0.2) 0.919 ( ± 0.4) 0.934 ( ± 0.4) 0.867 ( ± 0.3) 0.378 ( ± 0.4) 0.020 ( ± 0.3) 0.937 ( ± 0.4) 0.970 ( ± 0.5) Linear Probing 0.977 ( ± 0.4) 0.117 ( ± 0.4) 0.015 ( ± 0.4) 0.884 ( ± 0.5) 0.883 ( ± 0.5) 0.858 ( ± 0.4) 0.395 ( ± 0.5) 0.048 ( ± 0.5) 0.785 ( ± 0.5) 0.776 ( ± 0.5) GPP 0.978 ( ± 0.3) 1.648 ( ± 0.5) 0.784 ( ± 0.5) 0.934 ( ± 0.5) 0.904 ( ± 0.5) 0.857 ( ± 0.4) 1.716 ( ± 0.5) 0.692 ( ± 0.6) 0.867 ( ± 0.5) 0.962 ( ± 0.5) Dropout 0.978 ( ± 0.4) 0.072 ( ± 0.3) 0.009 ( ± 0.3) 0.923 ( ± 0.4) 0.944 ( ± 0.4) 0.858 ( ± 0.4) 0.393 ( ± 0.5) 0.009 ( ± 0.3) 0.850 ( ± 0.4) 0.911 ( ± 0.4) Ensemble 0.979 ( ± 0.3) 0.069 ( ± 0.3) 0.038 ( ± 0.5) 0.936 ( ± 0.5) 0.962 ( ± 0.4) 0.839 ( ± 0.5) 0.473 ( ± 0.6) 0.041 ( ± 0.4) 0.876 ( ± 0.5) 0.983 ( ± 0.5) GAP A (ours) 0.978 ( ± 0.3) 0.109 ( ± 0.4) 0.049 ( ± 0.4) 0.960 ( ± 0.4) 0.972 ( ± 0.4) 0.859 ( ± 0.4) 0.389 ( ± 0.5) 0.013 ( ± 0.3) 0.973 ( ± 0.4) 0.993 ( ± 0.3) L. T ables with Standard Deviations L.1. Regression L.2. Feedf orward Neural Network Classiﬁcation L.3. ResNet M. Ablation Studies W e in vestigate three k ey design choices in GAP A: layer placement, number of inducing points, and sampling strategy . M.1. Where to put GAP A T able 11 (and Figure 9) examines GAP A placement across our 4-layer network. For MNIST , placing GAP A at layer 3 achiev es the best NLL (0.068), while layer 4 or any combination including layer 4 maximizes OOD detection (0.953 A UC, 0.961 B ALD). For FMNIST , similar patterns emerge: layer 3 minimizes NLL (0.309), while layer 4 dominates OOD metrics (0.973 A UC, 0.969 B ALD). Interestingly , adding more GAP A layers generally degrades NLL while maintaining strong OOD performance, suggesting a trade-off between calibration and uncertainty aw areness. The ﬁnal layer (closest to output) appears most critical for OOD detection, while intermediate layers better preserve calibration. M.2. Number of inducing inputs T able 11 shows performance as M increases from 10 to 55,000. Both datasets exhibit clear saturation: MNIST plateaus around M = 40 , 000 (NLL: 0.119 → 0.117, OOD: 0.953), while FMNIST shows similar con ver gence. Computational costs 24 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks T able 9. GAP A and baselines on CIF AR-10 with ResNet-20 and ResNet-32. Best in purple , second-best in teal . Results sho wn as mean ( ± std) ov er 5 runs. ResNet-20 ResNet-32 A CC NLL OOD Train T est A CC NLL OOD Train Test MAP 92.6 ( ± 0.06) 0.282 ( ± 0.10) 0.876 ( ± 0.10) – – 93.5 ( ± 0.06) 0.292 ( ± 0.10) 0.909 ( ± 0.10) – – MF-VI 92.7 ( ± 0.12) 0.231 ( ± 0.15) 0.865 ( ± 0.12) 0.74 ( ± 0.04) 0.47 ( ± 0.02) 93.5 ( ± 0.11) 0.222 ( ± 0.15) 0.885 ( ± 0.12) 1.19 ( ± 0.06) 0.75 ( ± 0.04) SNGP 92.4 ( ± 0.08) 0.266 ( ± 0.12) 0.875 ( ± 0.10) 15.9 ( ± 0.8) 1.31 ( ± 0.07) 93.2 ( ± 0.08) 0.256 ( ± 0.12) 0.890 ( ± 0.10) 25.5 ( ± 1.3) 2.10 ( ± 0.11) LLA Diag 92.6 ( ± 0.08) 0.260 ( ± 0.12) 0.866 ( ± 0.10) 18.4 ( ± 0.9) 0.43 ( ± 0.02) 93.5 ( ± 0.07) 0.242 ( ± 0.12) 0.882 ( ± 0.10) 29.4 ( ± 1.5) 0.69 ( ± 0.03) Sampled LLA 92.5 ( ± 0.09) 0.231 ( ± 0.14) 0.885 ( ± 0.12) 5.00K ( ± 0.25K) 1.14K ( ± 0.06K) 93.5 ( ± 0.08) 0.217 ( ± 0.14) 0.905 ( ± 0.12) 8.00K ( ± 0.40K) 1.83K ( ± 0.09K) V aLLA 92.4 ( ± 0.10) 0.231 ( ± 0.15) 0.940 ( ± 0.12) 7.59K ( ± 0.38K) 124 ( ± 6) 93.2 ( ± 0.09) 0.212 ( ± 0.15) 0.933 ( ± 0.12) 12.2K ( ± 0.61K) 199 ( ± 10) GAP A (ours) 92.6 ( ± 0.07) 0.258 ( ± 0.12) 0.907 ( ± 0.10) 3.65 ( ± 0.18) 1.30 ( ± 0.07) 93.5 ( ± 0.07) 0.259 ( ± 0.12) 0.926 ( ± 0.10) 5.84 ( ± 0.29) 2.07 ( ± 0.10) T able 10. GAP A and baselines on CIF AR-10 with ResNet-44 and ResNet-56. Best in purple , second-best in teal . Results shown as mean ( ± std) ov er 5 runs. ResNet-44 ResNet-56 A CC NLL OOD Train T est A CC NLL OOD Train Test MAP 94.0 ( ± 0.05) 0.275 ( ± 0.10) 0.885 ( ± 0.10) – 0.761 ( ± 0.04) 94.4 ( ± 0.05) 0.252 ( ± 0.10) 0.924 ( ± 0.10) – 0.949 ( ± 0.05) MF-VI 93.9 ( ± 0.10) 0.206 ( ± 0.14) 0.890 ( ± 0.12) 1.63 ( ± 0.08) 1.03 ( ± 0.05) 94.4 ( ± 0.10) 0.188 ( ± 0.14) 0.929 ( ± 0.12) 1.97 ( ± 0.10) 1.18 ( ± 0.06) SNGP 93.8 ( ± 0.07) 0.242 ( ± 0.12) 0.901 ( ± 0.10) 35.0 ( ± 1.8) 2.89 ( ± 0.14) 93.8 ( ± 0.07) 0.229 ( ± 0.12) 0.940 ( ± 0.10) 43.5 ( ± 2.2) 3.01 ( ± 0.15) LLA Diag 94.0 ( ± 0.07) 0.218 ( ± 0.12) 0.860 ( ± 0.10) 40.4 ( ± 2.0) 0.947 ( ± 0.05) 94.3 ( ± 0.06) 0.195 ( ± 0.12) 0.923 ( ± 0.10) 40.7 ( ± 2.0) 1.12 ( ± 0.06) Sampled LLA 94.0 ( ± 0.08) 0.200 ( ± 0.13) 0.899 ( ± 0.12) 11.0K ( ± 0.55K) 2.51K ( ± 0.13K) 94.4 ( ± 0.07) 0.185 ( ± 0.13) 0.944 ( ± 0.12) 14.6K ( ± 0.73K) 2.84K ( ± 0.14K) V aLLA 93.8 ( ± 0.09) 0.201 ( ± 0.14) 0.928 ( ± 0.12) 16.7K ( ± 0.84K) 272.9 ( ± 14) 94.2 ( ± 0.08) 0.188 ( ± 0.14) 0.960 ( ± 0.12) 26.3K ( ± 1.32K) 363.8 ( ± 18) GAP A (ours) 94.0 ( ± 0.06) 0.230 ( ± 0.12) 0.931 ( ± 0.10) 8.03 ( ± 0.40) 2.85 ( ± 0.14) 94.4 ( ± 0.06) 0.230 ( ± 0.12) 0.953 ( ± 0.10) 10.29 ( ± 0.51) 3.30 ( ± 0.17) scale sub-linearly due to F AISS indexing—setup time increases from 2.7s to 455s for MNIST , while inference remains tractable (7.5s → 20s). This demonstrates GAP A ’ s efﬁciency: near -optimal uncertainty quantiﬁcation is achiev able with moderate M v alues, making the method practical for larger models. M.3. Inducing point selection: KMeans vs. farthest-point sampling W e compare two strategies for selecting inducing points: the farthest-point sampling (FPS) method used in the main paper , and the KMeans-based option introduced in Appendix G.1. Figures 10–11 report results for MNIST and FMNIST across a range of inducing-point budgets M . Overall, both methods exhibit similar beha viour: performance improv es monotonically with M and saturates once a sufﬁcient cov erage of the activ ation space is achiev ed. KMeans, howe ver , provides a more efﬁcient trade-of f between cov erage and inducing-point count, reaching its plateau at substantially smaller M values than FPS. This mak es KMeans a practical alternativ e when memory , storage, or index construction time is a constraint. M.4. Random vs Furthers Point Sampling T able 11 rev eals that furthest point sampling (FPS) and random sampling exhibit different strengths. At smaller M (5K-10K), random sampling achie ves better NLL and OOD detection, likely because FPS’ s greedy selection may o verﬁt to speciﬁc acti vation patterns. Howe ver , as M increases to 40K, FPS shows marginal impro vements, suggesting its structured coverage becomes beneﬁcial with sufﬁcient inducing points. The conv ergence of both methods at large M indicates that with enough [1] [2] [3] [4] [1, 2] [1, 3] [1, 4] [2, 3] [2, 4] [3, 4] [1, 2, 3] [1, 2, 4] [1, 3, 4] [2, 3, 4] [1, 2, 3, 4] GAP A layers 0.1 0.2 0.3 NLL NLL with GAP A at differ ent layers Dataset MNIST FMNIST [1] [2] [3] [4] [1, 2] [1, 3] [1, 4] [2, 3] [2, 4] [3, 4] [1, 2, 3] [1, 2, 4] [1, 3, 4] [2, 3, 4] [1, 2, 3, 4] GAP A layers 0.90 0.92 0.94 0.96 OOD BALD OOD BALD with GAP A at differ ent layers Dataset MNIST FMNIST Figure 9. Comparison of metrics at different GAP A layer placements ( M = 55 , 000 ) 25 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks T able 11. Comparison of metrics at different GAP A layer placements ( M = 55 , 000 ). Best v alues are bold . Lower is better ( ↓ ) for NLL; higher is better ( ↑ ) for OOD-A UC/B ALD. GAP A layers MNIST FMNIST NLL ↓ OOD-A UC ↑ OOD B ALD ↑ NLL ↓ OOD-A UC ↑ OOD B ALD ↑ [1] 0.072 0.915 0.916 0.326 0.870 0.900 [2] 0.070 0.921 0.920 0.321 0.884 0.907 [3] 0.068 0.933 0.933 0.309 0.901 0.911 [4] 0.117 0.951 0.957 0.353 0.973 0.969 [1, 2] 0.069 0.923 0.922 0.318 0.901 0.921 [1, 3] 0.069 0.934 0.934 0.309 0.912 0.924 [1, 4] 0.120 0.953 0.961 0.357 0.973 0.969 [2, 3] 0.070 0.935 0.935 0.310 0.917 0.928 [2, 4] 0.122 0.953 0.961 0.360 0.973 0.968 [3, 4] 0.134 0.953 0.961 0.372 0.973 0.966 [1, 2, 3] 0.072 0.936 0.935 0.312 0.924 0.934 [1, 2, 4] 0.125 0.953 0.960 0.364 0.973 0.968 [1, 3, 4] 0.137 0.953 0.960 0.376 0.973 0.966 [2, 3, 4] 0.139 0.953 0.960 0.380 0.973 0.966 [1, 2, 3, 4] 0.142 0.953 0.960 0.384 0.974 0.966 inducing points, the activ ation space is well-covered re gardless of sampling strategy . M.5. KNN Sweep: K = 1 to 500 T o ev aluate the robustness of the KNN GAP A approximation used in GAP A, we performed a comprehensive KNN sweep over K = { 1 , 2 , 3 , 5 , 10 , 20 , 50 , 100 , 150 , 200 , 300 , 400 , 500 } on both MNIST and FMNIST . For each K , we recomputed the GP posterior variance using the K nearest cached activ ations and measured all uncertainty metrics (NLL, ECE, OOD-A UC, OOD-B ALD) as well as test-time inference cost. Across all metrics and datasets, the results rev eal a strikingly consistent pattern: all curves improve smoothly and monotonically with K , and we observed no instability—ev en at K = 1 . Negative Log-Lik elihood (NLL). NLL decreases continuously as K increases for both datasets. MNIST improv es from ≈ 0 . 092 at K =1 to ≈ 0 . 081 at K =500 . FMNIST improves from ≈ 0 . 408 at K =1 to ≈ 0 . 390 at K =500 . Expected Calibration Error (ECE). ECE improv es monotonically for both datasets. MNIST decreases from ≈ 0 . 062 to ≈ 0 . 015 . FMNIST shows a similar smooth trend. Expected Calibration Error (ECE). ECE improv es monotonically for both datasets. MNIST decreases from ≈ 0 . 062 to ≈ 0 . 015 . FMNIST shows a similar smooth trend. OOD-A UC. OOD detection improv es slightly with K . MNIST increases from 0 . 950 (K=1) to 0 . 963 (K=500). FMNIST improv es up to K ≈ 50 , then plateaus or slightly degrades for very lar ge K due to o ver-smoothing. 26 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks T able 12. Metrics across different M values for MNIST and FMNIST , GAP A at the 4th layer . MNIST FMNIST M NLL ↓ OOD ↑ B ALD ↑ set up/s ↓ inference/s ↓ NLL ↓ OOD ↑ B ALD ↑ set up/s ↓ inference/s ↓ 10 0.248 0.897 0.919 2.733 7.517 0.489 0.957 0.936 0.257 7.584 100 0.248 0.897 0.919 185.477 7.478 0.489 0.957 0.936 181.340 7.625 1000 0.246 0.898 0.920 184.787 7.674 0.486 0.957 0.937 183.503 7.763 5000 0.219 0.913 0.934 195.889 8.663 0.470 0.960 0.943 194.468 8.702 10000 0.181 0.933 0.950 212.990 10.119 0.442 0.964 0.952 211.333 9.873 20000 0.139 0.947 0.958 247.684 12.498 0.390 0.970 0.964 241.000 12.164 40000 0.119 0.953 0.961 301.511 16.926 0.355 0.972 0.968 301.086 16.826 55000 0.117 0.953 0.961 455.735 20.445 0.353 0.973 0.969 384.825 20.527 1 0 1 1 0 2 1 0 3 1 0 4 Number of inducing inputs (log scale) 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 NLL FMNIST NLL vs Number of Inducing Inputs Inducing method FPS KMeans 1 0 1 1 0 2 1 0 3 1 0 4 Number of inducing inputs (log scale) 0.930 0.935 0.940 0.945 0.950 0.955 0.960 0.965 OOD- AUC FMNIST OOD- AUC vs Number of Inducing Inputs Inducing method FPS KMeans Figure 10. FMNIST : NLL (left) and OOD-A UC (right) for KMeans vs. FPS across M . OOD B ALD. Epistemic sensitivity improv es steadily for both datasets, with consistent beha viour across the entire sweep. T est-time cost. T est-time increases roughly linearly with K for both datasets. For MNIST , inference grows from ≈ 2 . 1 ms to ≈ 16 ms. FMNIST follows the same scaling pattern. T akeaway . These experiments sho w: • 1-NN is already stable and competitiv e , especially for OOD detection. • Increasing K to 20–50 provides clear gains in calibration and NLL. • V ery large K has diminishing returns and incurs high compute cost. Overall, the full sweep conﬁrms that the KNN GAP A approximation is rob ust, stable, and effective , and that GAP A behav es predictably across the entire KNN range. N. Extended Related W ork Uncertainty quantiﬁcation (UQ) in deep learning differs primarily in wher e uncertainty is placed (e.g., weights, outputs, or representations) and in the resulting training and test-time costs. W e focus on the post-hoc regime for fr ozen pretr ained backbones , where retraining, multiple test-time samples, or full-netw ork second-order computation can be infeasible. 27 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks 1 0 1 1 0 2 1 0 3 1 0 4 Number of inducing inputs (log scale) 0 2500 5000 7500 10000 12500 15000 17500 20000 train_time FMNIST train_time vs Number of Inducing Inputs Inducing method FPS KMeans 1 0 1 1 0 2 1 0 3 1 0 4 Number of inducing inputs (log scale) 0 2500 5000 7500 10000 12500 15000 17500 20000 train_time MNIST train_time vs Number of Inducing Inputs Inducing method FPS KMeans Figure 11. Setup time (F AISS indexing) for KMeans vs. FPS on FMNIST (left) and MNIST (right). 1 0 1 1 0 2 1 0 3 1 0 4 Number of inducing inputs (log scale) 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NLL MNIST NLL vs Number of Inducing Inputs Inducing method FPS KMeans 1 0 1 1 0 2 1 0 3 1 0 4 Number of inducing inputs (log scale) 0.88 0.89 0.90 0.91 0.92 0.93 0.94 0.95 OOD- AUC MNIST OOD- AUC vs Number of Inducing Inputs Inducing method FPS KMeans Figure 12. MNIST : NLL (left) and OOD-A UC (right) for KMeans vs. FPS across M . W eight-space Bayesianization and Laplace. W eight-space Bayesian methods model uncertainty directly ov er parameters, typically requiring either approximate inference during training or posterior sampling at test time. Laplace approximations instead ﬁt a local Gaussian posterior around a trained solution using curv ature information, and recent work has re visited Laplace as a competitiv e and practical Bayesian deep learning baseline, including analyses of when structured approximations are necessary for scalability and ﬁdelity (Blundell et al., 2015; MacKay, 1992; Daxberge r et al., 2021; Ortega et al., 2023; Deng et al., 2022). Closely related, stochastic weight-space approximations such as SW A G provide inexpensi ve posterior samples from SGD trajectories and often serve as strong uncertainty baselines without changing the underlying architecture (Maddox et al., 2019). Last-layer and single-pass post-hoc methods. A popular compromise Bayesianizes only the ﬁnal layer(s) while keeping the feature extractor frozen, yielding post-hoc uncertainty at substantially reduced cost. Beyond Laplace-style last-layer treatments, deterministic variational formulations such as variational Bayesian last layers (VBLL) aim to deliver single-pass predictiv e uncertainty for frozen-backbone models (Harrison et al., 2024). These approaches are strong deployment-friendly baselines, but the y concentrate Bayesian modeling at the head and do not in general propagate mean-preserving uncertainty through intermediate computations of the frozen network. Sampling-based baselines. Deep ensembles approximate epistemic uncertainty through multiple independently trained models (Lakshminarayanan et al., 2017), while MC Dropout relies on multiple stochastic forward passes at inference (Gal and Ghahramani, 2016). These are effecti ve b ut often incompatible with strict single-pass deployment constraints. 28 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks 1 0 1 1 0 2 1 0 3 1 0 4 Number of inducing inputs (log scale) 0.88 0.90 0.92 0.94 0.96 OOD BALD MNIST OOD BALD vs Number of Inducing Inputs Inducing method FPS KMeans 1 0 1 1 0 2 1 0 3 1 0 4 Number of inducing inputs (log scale) 0.94 0.95 0.96 0.97 0.98 OOD BALD FMNIST OOD BALD vs Number of Inducing Inputs Inducing method FPS KMeans Figure 13. BALD-based OOD detection for MNIST (left) and FMNIST (right). T able 13. Comparison of NLL and OOD B ALD for FPS and three random baselines (FMNIST , gapa_inde x= last layer). M FPS NLL ↓ FPS OOD ↑ Rand1 NLL ↓ Rand1 OOD ↑ Rand2 NLL ↓ Rand2 OOD ↑ Rand3 NLL ↓ Rand3 OOD ↑ 5000 0.470 0.943 0.394 0.957 0.394 0.957 0.394 0.957 10000 0.442 0.952 0.380 0.960 0.380 0.960 0.380 0.960 20000 0.390 0.964 0.369 0.964 0.369 0.964 0.369 0.964 40000 0.355 0.968 0.359 0.967 0.359 0.967 0.359 0.967 Representation-based and calibration baselines. Distance-/density-aware approaches estimate uncertainty from rep- resentations, e.g., spectral-normalized GP-style heads (SNGP) (Liu et al., 2020) or density-based uncertainty on deep features (DDU) (Mukhoti et al., 2023). Calibration-only post-processing such as temperature scaling can impro ve conﬁdence calibration but does not model epistemic uncertainty (Guo et al., 2017). Gaussian processes and function-space views. Gaussian processes (GPs) provide a classical function-space approach to uncertainty and connect naturally to inﬁnite-width neural networks: fully-connected nets conv erge to an NNGP prior (Lee et al., 2017), and their inﬁnite-width traini ng dynamics are characterized by the NTK (Jacot et al., 2018). For scalability , GP/kernel Inference is commonly accelerated via lo w-rank approximations such as inducing points (T itsias, 2009) and Nyström methods (W illiams and Seeger, 2000), while local GP approximations and noisy-input corrections (e.g., NIGP) provide ef ﬁcient variance adjustments under locality or input uncertainty (Gramacy and Apley, 2015; McHutchon and Rasmussen, 2011). GAP A builds on these function-space ideas b ut applies them inside frozen netw orks by placing uncertainty in activ ation space, preserving the pretrained mean and enabling deterministic single-pass inference. Summary . Overall, existing approaches trade off between (i) retraining or multi-sample inference, (ii) post-hoc last- layer approximations that concentrate uncertainty at the head, or (iii) representation-based proxies. These gaps motiv ate mean-preserving, post-hoc activ ation-space uncertainty with deterministic single-pass inference. 29 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks 0 100 200 300 400 500 gapa_knns (K) 0.08 0.09 0.10 0.11 0.12 NLL MNIST NLL vs gapa_knns Figure 14. MNIST NLL vs. K . 0 100 200 300 400 500 gapa_knns (K) 0.3875 0.3900 0.3925 0.3950 0.3975 0.4000 0.4025 0.4050 0.4075 NLL FMNIST NLL vs gapa_knns Figure 15. FMNIST NLL vs. K . 0 100 200 300 400 500 gapa_knns (K) 0.08 0.09 0.10 0.11 0.12 NLL MNIST NLL vs gapa_knns Figure 16. MNIST NLL vs. K . 0 100 200 300 400 500 gapa_knns (K) 0.3875 0.3900 0.3925 0.3950 0.3975 0.4000 0.4025 0.4050 0.4075 NLL FMNIST NLL vs gapa_knns Figure 17. FMNIST NLL vs. K . 0 100 200 300 400 500 gapa_knns (K) 0.02 0.03 0.04 0.05 0.06 ECE MNIST ECE vs gapa_knns Figure 18. MNIST ECE vs. K . 0 100 200 300 400 500 gapa_knns (K) 0.01 0.02 0.03 0.04 0.05 ECE FMNIST ECE vs gapa_knns Figure 19. FMNIST ECE vs. K . 30 Activation-Space Uncertainty Quantiﬁcation f or Pretrained Networks 0 100 200 300 400 500 gapa_knns (K) 0.950 0.952 0.954 0.956 0.958 0.960 0.962 OOD- AUC MNIST OOD- AUC vs gapa_knns Figure 20. MNIST OOD-A UC vs. K . 0 100 200 300 400 500 gapa_knns (K) 0.92 0.93 0.94 0.95 0.96 OOD- AUC FMNIST OOD- AUC vs gapa_knns Figure 21. FMNIST OOD-A UC vs. K . 0 100 200 300 400 500 gapa_knns (K) 0.960 0.962 0.964 0.966 0.968 0.970 0.972 0.974 0.976 OOD BALD MNIST OOD BALD vs gapa_knns Figure 22. MNIST OOD-BALD vs. K . 0 100 200 300 400 500 gapa_knns (K) 0.986 0.987 0.988 0.989 0.990 0.991 OOD BALD FMNIST OOD BALD vs gapa_knns Figure 23. FMNIST OOD-BALD vs. K . 1 0 0 1 0 1 1 0 2 gapa_knns (K) 4 6 8 10 12 14 16 test_time MNIST test_time vs gapa_knns Figure 24. MNIST test time vs. K . 1 0 0 1 0 1 1 0 2 gapa_knns (K) 4 6 8 10 12 14 16 test_time FMNIST test_time vs gapa_knns Figure 25. FMNIST test time vs. K . 31

Activation-Space Uncertainty Quantification for Pretrained Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment