LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.
💡 Research Summary
Medical vision‑language models (VLMs) such as CLIP and its domain‑specialized variants have demonstrated impressive zero‑shot classification abilities across a range of imaging modalities. However, in safety‑critical medical settings, raw confidence scores are insufficient; clinicians need calibrated uncertainty with finite‑sample guarantees, especially under domain shift, class imbalance, and limited calibration data. Split conformal prediction (SCP) offers marginal coverage guarantees by calibrating a non‑conformity score on a held‑out labeled set, but standard scores (LAC, APS, RAPS) often produce overly large prediction sets and exhibit a high class‑conditioned coverage gap (CCV). Moreover, naïvely fine‑tuning a VLM on calibration labels breaks the exchangeability assumption required for SCP, invalidating its guarantees.
The authors introduce LATA (Laplacian‑Assisted Transductive Adaptation), a label‑free, training‑free transductive refinement that operates on the joint calibration‑test pool. First, a sparse k‑nearest‑neighbor graph is built on ℓ₂‑normalized image embeddings, yielding a symmetric affinity matrix Wg. The zero‑shot probability vectors q(x) from the frozen VLM serve as initial beliefs. LATA then solves a regularized objective that balances fidelity to q(x) (via KL divergence) and smoothness over the graph (quadratic Laplacian term). By decomposing the objective into a convex part (KL + diagonal L2) and a concave part (pairwise interaction), the authors apply the Concave‑Convex Procedure (CCCP). The resulting mean‑field update is multiplicative: ˜z_i ∝ q_i · exp(γ ∑_j Wg_ij ˜z_j), followed by row‑normalization. A handful of iterations (5‑10) suffices, and because the same deterministic transform is applied to both calibration and test samples, exchangeability is preserved and SCP validity remains intact.
To further improve set efficiency, the paper incorporates a failure‑aware conformal score built on the Vision‑Language Uncertainty (ViLU) module. ViLU, pre‑trained on an auxiliary dataset, outputs a per‑image failure probability u(x) and a label‑attention vector α(x). The new non‑conformity score is S★(x,y)=S_base(˜z(x),y)·(1+λ u(x))−η α_y(x), where S_base can be LAC, APS, or RAPS. The term λ u(x) inflates scores for hard inputs, protecting coverage, while η α_y(x) discounts labels that the multimodal attention deems plausible, shrinking sets for easy cases. This scoring is applied symmetrically to calibration and test data, again preserving exchangeability.
An optional prior bias can be injected using calibration label marginals m. By adjusting the zero‑shot logits with β log m (β∈
Comments & Academic Discussion
Loading comments...
Leave a Comment