DoubleTake: Contrastive Reasoning for Faithful Decision-Making in Medical Imaging

DoubleTake: Contrastive Reasoning for Faithful Decision-Making in Medical Imaging
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accurate decision making in medical imaging requires reasoning over subtle visual differences between confusable conditions, yet most existing approaches rely on nearest neighbor retrieval that returns redundant evidence and reinforces a single hypothesis. We introduce a contrastive, document-aware reference selection framework that constructs compact evidence sets optimized for discrimination rather than similarity by explicitly balancing visual relevance, embedding diversity, and source-level provenance using ROCO embeddings and metadata. While ROCO provides large-scale image-caption pairs, it does not specify how references should be selected for contrastive reasoning, and naive retrieval frequently yields near-duplicate figures from the same document. To address this gap, we release a reproducible reference selection protocol and curated reference bank that enable a systematic study of contrastive retrieval in medical image reasoning. Building on these contrastive evidence sets, we propose Counterfactual-Contrastive Inference, a confidence-aware reasoning framework that performs structured pairwise visual comparisons and aggregates evidence using margin-based decision rules with faithful abstention. On the MediConfusion benchmark, our approach achieves state-of-the-art performance, improving set-level accuracy by nearly 15% relative to prior methods while reducing confusion and improving individual accuracy.


💡 Research Summary

**
The paper tackles a fundamental shortcoming of current medical‑image decision‑support systems: they excel at recognizing the presence of a finding but often fail when two visually similar conditions must be distinguished. Existing approaches typically augment inference with nearest‑neighbor retrieval from large vision‑language corpora such as ROCO. While this can provide contextual examples, it is optimized for similarity, not discrimination. Consequently, retrieved evidence is frequently redundant—multiple figures from the same article that depict the same pathology—reinforcing a single hypothesis and obscuring subtle diagnostic cues.

To address this, the authors propose a two‑stage framework that (1) constructs a contrastive, document‑aware evidence set for each query and (2) performs confidence‑aware aggregation of structured pairwise comparisons, called Counterfactual‑Contrastive Inference (CCI).

Stage 1 – Contrastive, Document‑Aware Reference Selection
The authors first build a reference bank from ROCO, storing for each image its CLIP‑ViT‑B/32 embedding, caption, document identifier, and imaging modality. For a query image x, they compute cosine similarity s(x, r) with every bank entry, then apply a near‑duplicate filter (removing any candidate r if its similarity to a previously retained image exceeds τ_dup = 0.99). The remaining candidates form a deduplicated pool C(x).

From C(x) they select exactly three references (a “triad”) using rank‑band heuristics:

  • Anchor (r₁) – the most similar non‑duplicate candidate. This provides a stable visual baseline that shares modality and anatomy with the query.
  • Hard Negative (r₂) – drawn from the mid‑similarity band (ranks 20‑200). Among these, the image that maximizes the L1 distance to the anchor in normalized embedding space is chosen. This forces the model to compare the query against an image that is still relevant but differs along subtle dimensions.
  • Boundary Probe (r₃) – taken from a broader band (ranks 200‑1000). Each candidate receives a score that combines (i) lexical overlap κ between its caption and the query’s question text, (ii) its similarity to the query, and (iii) its dissimilarity to the anchor. The candidate with the highest score becomes r₃.

The selection process explicitly enforces (when possible) distinct document IDs and modality consistency (e.g., “CT”, “MRI”). If a constraint cannot be satisfied within a band, the algorithm relaxes it gradually, ensuring a valid triad is always produced. This design yields an evidence set that maximizes relevance, penalizes redundancy, and respects provenance, thereby aligning retrieval with the discriminative nature of confusion‑style benchmarks.

Stage 2 – Counterfactual‑Contrastive Inference (CCI)
Given the triad R(x) = {r₁, r₂, r₃}, a vision‑language model f_θ processes each (query, reference) pair using a fixed prompt that asks the model to (a) choose between the two answer options A or B, (b) report a self‑confidence α_i ∈


Comments & Academic Discussion

Loading comments...

Leave a Comment