Confidence Calibration under Ambiguous Ground Truth

Confidence Calibration under Ambiguous Ground Truth
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Confidence calibration assumes a unique ground-truth label per input, yet this assumption fails wherever annotators genuinely disagree. Post-hoc calibrators fitted on majority-voted labels, the standard single-label targets used in practice, can appear well-calibrated under conventional evaluation yet remain substantially miscalibrated against the underlying annotator distribution. We show that this failure is structural: under simplifying assumptions, Temperature Scaling is biased toward temperatures that underestimate annotator uncertainty, with true-label miscalibration increasing monotonically with annotation entropy. To address this, we develop a family of ambiguity-aware post-hoc calibrators that optimise proper scoring rules against the full label distribution and require no model retraining. Our methods span progressively weaker annotation requirements: Dirichlet-Soft leverages the full annotator distribution and achieves the best overall calibration quality across settings; Monte Carlo Temperature Scaling with a single annotation per example (MCTS S=1) matches full-distribution calibration across all benchmarks, demonstrating that pre-aggregated label distributions are unnecessary; and Label-Smooth Temperature Scaling (LS-TS) operates with voted labels alone by constructing data-driven pseudo-soft targets from the model’s own confidence. Experiments on four benchmarks with real multi-annotator distributions (CIFAR-10H, ChaosNLI) and clinically-informed synthetic annotations (ISIC~2019, DermaMNIST) show that Dirichlet-Soft reduces true-label ECE by 55-87% relative to Temperature Scaling, while LS-TS reduces ECE by 9-77% without any annotator data.


💡 Research Summary

The paper tackles a fundamental flaw in modern confidence calibration methods when applied to tasks where the ground‑truth label is inherently ambiguous. Conventional post‑hoc calibrators such as Temperature Scaling (TS), Platt scaling, or Dirichlet calibration are trained on a single “voted” label per instance, assuming a unique correct class. In many real‑world domains—medical imaging, natural‑language inference, low‑resolution vision—human annotators legitimately disagree, and the true target is a probability distribution over classes (the annotator distribution π(·|x)).

The authors first formalize “true‑label calibration”: a model is truly calibrated if, for any confidence level p, the probability that a randomly drawn annotator label matches the model’s top‑class prediction equals p. They show that calibrating against the voted label optimizes the wrong objective, leading to systematic over‑confidence with respect to the actual annotator distribution. Theoretical analysis yields two key propositions: (1) TS is biased toward a lower temperature than would be optimal for a soft‑label loss, because the voted‑label loss pushes the model to assign probability 1 to the majority class; (2) the miscalibration gap grows monotonically with annotation entropy, i.e., the more disagreement among annotators, the larger the error. Empirical evidence on a synthetic 2‑D Gaussian dataset and on four real benchmarks (CIFAR‑10H, ChaosNLI, ISIC 2019, DermaMNIST) confirms that while TS dramatically reduces ECE measured against voted labels, it actually worsens ECE measured against the true annotator distribution, especially on ambiguous clusters.

To remedy this, the paper introduces a family of ambiguity‑aware post‑hoc calibrators that directly target the annotator distribution without retraining the underlying model:

  1. Dirichlet‑Soft – when the full per‑instance annotator distribution is available, a learnable Dirichlet transformation is applied to the logits, and cross‑entropy is minimized against the soft targets π̂(x). This method achieves the best calibration across all settings.

  2. Monte Carlo Temperature Scaling (MCTS) – assumes only individual annotations are accessible. By sampling a single annotation per example (S = 1) and averaging the resulting temperature estimates, MCTS matches the performance of full‑distribution calibration, demonstrating that pre‑aggregated soft labels are unnecessary.

  3. Label‑Smooth Temperature Scaling (LS‑TS) – operates with only the majority‑voted label. It constructs pseudo‑soft targets by smoothing the model’s own confidence scores (a data‑driven label‑smoothing scheme) and then fits a temperature parameter. LS‑TS provides a substantial calibration gain even when no annotator data are present.

All three methods are post‑hoc: they only require the stored logits of a pre‑trained classifier and a calibration set. Evaluation uses proper scoring rules (Brier score, negative log‑likelihood) and a Monte‑Carlo estimate of Expected Calibration Error (ECE) with respect to the true annotator distribution (ECE_true). Results show that Dirichlet‑Soft reduces ECE_true by 55–87 % relative to standard TS, while LS‑TS achieves 9–77 % reduction without any annotator information. Notably, even Dirichlet calibration with hard (voted) targets underperforms TS, confirming that the mis‑specified objective, not model capacity, is the root cause.

The study concludes that in domains with genuine label ambiguity, calibration must be reframed to predict the full annotator distribution rather than a single majority label. The proposed methods offer a practical toolkit that adapts to the amount of annotation data available, paving the way for more reliable uncertainty estimates in high‑stakes applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment