Bi-MCQ: Reformulating Vision-Language Alignment for Negation Understanding
Recent vision-language models (VLMs) achieve strong zero-shot performance via large-scale image-text pretraining and have been widely adopted in medical image analysis. However, existing VLMs remain notably weak at understanding negated clinical statements, largely due to contrastive alignment objectives that treat negation as a minor linguistic variation rather than a meaning-inverting operator. In multi-label settings, prompt-based InfoNCE fine-tuning further reinforces easy-positive image-prompt alignments, limiting effective learning of disease absence. To overcome these limitations, we reformulate vision-language alignment as a conditional semantic comparison problem, which is instantiated through a bi-directional multiple-choice learning framework(Bi-MCQ). By jointly training Image-to-Text and Text-to-Image MCQ tasks with affirmative, negative, and mixed prompts, our method implements fine-tuning as conditional semantic comparison instead of global similarity maximization. We further introduce direction-specific Cross-Attention fusion modules to address asymmetric cues required by bi-directional reasoning and reduce alignment interference. Experiments on ChestXray14, Open-I, CheXpert, and PadChest show that Bi-MCQ improves negation understanding by up to 0.47 AUC over the zero-shot performance of the state-of-the-art CARZero model, while achieving up to a 0.08 absolute gain on positive-negative combined (PNC) evaluation. Additionally, Bi-MCQ reduces the affirmative-negative AUC gap by an average of 0.12 compared to InfoNCE-based fine-tuning, demonstrating that objective reformulation can substantially enhance negation understanding in medical VLMs.
💡 Research Summary
The paper addresses a critical shortcoming of current medical vision‑language models (VLMs): their inability to correctly interpret negated clinical statements. Existing models rely on large‑scale image‑text pretraining followed by contrastive fine‑tuning using InfoNCE, which maximizes global similarity between image and text embeddings. Because negation merely flips meaning without substantially altering lexical content, the contrastive objective treats “no evidence of pneumonia” and “evidence of pneumonia” as nearly identical, causing the embeddings to collapse. This problem is amplified in multi‑label medical datasets where normal or disease‑absent cases dominate; generic negative prompts form easy positive alignments, providing weak learning signals for disease absence.
To overcome these issues, the authors reformulate vision‑language alignment as a conditional semantic comparison task and introduce a bi‑directional multiple‑choice question (Bi‑MCQ) fine‑tuning framework. Two complementary tasks are jointly optimized: Image‑to‑Text (I2T) MCQ, where each image is paired with affirmative, negative, and mixed textual candidates and the model must select the semantically correct prompt; and Text‑to‑Image (T2I) MCQ, where each textual query is paired with a set of images from the same batch and the model must pick the image that truly matches the query. By placing affirmative and negative statements in direct competition, the approach forces the model to resolve negation rather than treating it as a low‑information modifier.
A key architectural contribution is the direction‑specific Cross‑Attention Fusion module. In I2T, the global image embedding serves as the query while both global and token‑level text embeddings act as keys and values, allowing the model to attend to textual cues (including negation) conditioned on visual evidence. Conversely, in T2I, the global text embedding queries the combination of global and spatial image features. Separate cross‑attention pathways prevent interference between the two directions and capture the asymmetric cues required for each reasoning mode.
Experiments were conducted on four chest‑X‑ray datasets: ChestXray14 (used for fine‑tuning), Open‑I, CheXpert, and PadChest (used for cross‑dataset evaluation). The primary metrics are AUC for negation understanding and Positive‑Negative Combined (PNC) score. Bi‑MCQ outperforms the state‑of‑the‑art zero‑shot model CARZero by up to 0.47 AUC and improves PNC by up to 0.08 absolute points. Compared with traditional InfoNCE‑based fine‑tuning, Bi‑MCQ reduces the affirmative‑negative AUC gap by an average of 0.12, demonstrating a more balanced discrimination of disease presence versus absence. Ablation studies confirm that both the bi‑directional MCQ formulation and the cross‑attention modules contribute substantially to these gains.
In summary, the paper’s contributions are threefold: (1) redefining the fine‑tuning objective from global similarity maximization to conditional semantic comparison, (2) introducing a bi‑directional MCQ paradigm that explicitly models negation as a core semantic factor, and (3) designing asymmetric cross‑attention fusion to support distinct image‑to‑text and text‑to‑image reasoning. The approach is not limited to medical imaging; it offers a general strategy for improving negation and conditional reasoning in any vision‑language system. Future work may extend the method to more complex logical constructs, additional imaging modalities, and real‑time clinical deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment