No Single Metric Tells the Whole Story: A Multi-Dimensional Evaluation Framework for Uncertainty Attributions
Research on explainable AI (XAI) has frequently focused on explaining model predictions. More recently, methods have been proposed to explain prediction uncertainty by attributing it to input features (uncertainty attributions). However, the evaluation of these methods remains inconsistent as studies rely on heterogeneous proxy tasks and metrics, hindering comparability. We address this by aligning uncertainty attributions with the well-established Co-12 framework for XAI evaluation. We propose concrete implementations for the correctness, consistency, continuity, and compactness properties. Additionally, we introduce conveyance, a property tailored to uncertainty attributions that evaluates whether controlled increases in epistemic uncertainty reliably propagate to feature-level attributions. We demonstrate our evaluation framework with eight metrics across combinations of uncertainty quantification and feature attribution methods on tabular and image data. Our experiments show that gradient-based methods consistently outperform perturbation-based approaches in consistency and conveyance, while Monte-Carlo dropconnect outperforms Monte-Carlo dropout in most metrics. Although most metrics rank the methods consistently across samples, inter-method agreement remains low. This suggests no single metric sufficiently evaluates uncertainty attribution quality. The proposed evaluation framework contributes to the body of knowledge by establishing a foundation for systematic comparison and development of uncertainty attribution methods.
💡 Research Summary
The paper tackles the problem of evaluating methods that attribute predictive uncertainty to input features – a task often referred to as “uncertainty attribution” within the emerging field of Explainable Uncertainty Quantification (XUQ). While many recent works propose novel ways to generate such attributions, the evaluation landscape is fragmented: researchers rely on heterogeneous proxy tasks and disparate metrics, making systematic comparison and reproducibility difficult. To address this, the authors adapt the well‑established Co‑12 framework from Explainable AI (XAI) to the specific needs of uncertainty explanations.
Four of the original Co‑12 properties—correctness, consistency, continuity, and compactness—are concretely instantiated for uncertainty attributions. Because uncertainty explanations have unique characteristics, the authors introduce a fifth property, “conveyance,” which measures whether controlled increases in epistemic uncertainty (e.g., by raising dropout rates) are reliably reflected in the feature‑level attribution scores. For each property, a quantitative metric is defined: correctness is measured by the reduction in predictive variance after removing high‑attribution features; consistency is assessed via rank‑correlation (Kendall’s τ) between attribution rankings produced by different stochastic runs or different uncertainty estimators; continuity uses a Lipschitz‑type metric to capture smoothness of attribution changes under small input perturbations; compactness evaluates how much of the total attribution mass is concentrated in a small subset of features; conveyance is quantified by the Pearson correlation between the magnitude of induced epistemic uncertainty and the resulting change in attribution scores.
The experimental setup combines two benchmark datasets—UCI Wine Quality (tabular) and MNIST (image)—with three uncertainty estimation techniques: Monte‑Carlo Dropout (MCD), Monte‑Carlo DropConnect (MCDC), and Deep Ensembles. For the underlying feature attribution component, the authors employ both gradient‑based methods (Layer‑wise Relevance Propagation, Integrated Gradients) and perturbation‑based methods (feature flipping, pixel blurring). This yields a matrix of method combinations, each evaluated across the eight metrics (four original properties plus conveyance, each with a primary metric). Sanity checks (randomization of model parameters, label shuffling) are performed to verify metric reliability.
Key findings are as follows: (1) Gradient‑based attribution methods consistently outperform perturbation‑based approaches on consistency and conveyance, indicating that direct use of model gradients captures uncertainty variations more faithfully than post‑hoc perturbations. (2) MCDC generally scores higher than MCD across most metrics, especially in compactness and conveyance, suggesting that stochastic weight masking yields richer epistemic uncertainty signals than stochastic activations. (3) Deep Ensembles achieve the highest correctness scores (largest variance reduction when high‑attribution features are removed) but are computationally expensive, limiting practical deployment. (4) Correlations among the eight metrics are low (e.g., correctness vs. conveyance τ≈0.2), confirming that no single metric can serve as a universal proxy for attribution quality. Consequently, a multi‑dimensional evaluation is essential.
The authors acknowledge several limitations. Their framework focuses on epistemic uncertainty and does not directly address aleatoric uncertainty, which may require different evaluation criteria. The proposed metrics rely on linear, conservation‑preserving attribution methods; non‑linear or sampling‑based explainers such as LIME and SHAP cannot be evaluated within this setup. Moreover, the study lacks human‑grounded evaluation, so the relationship between the proposed metrics and actual user trust or decision‑making effectiveness remains untested.
In conclusion, the paper makes a substantial contribution by providing a systematic, functionally‑grounded evaluation suite for uncertainty attribution methods. By aligning with the Co‑12 properties, introducing the novel conveyance metric, and empirically demonstrating the framework across diverse models and datasets, the work establishes a solid baseline for future research in XUQ. It highlights the necessity of multi‑metric assessment and opens avenues for extending the framework to aleatoric uncertainty, non‑linear explainers, and user‑centric studies.
Comments & Academic Discussion
Loading comments...
Leave a Comment