Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with systematically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual prompts, often overriding clear visual evidence in favor of the conflicting text, and show an average performance drop of over 48.2% after only one round of persuasive conversation. Our findings highlight a critical limitation in current VLMs and underscore the need for improved robustness against textual manipulation.

💡 Research Summary

This paper investigates the vulnerability of Vision‑Language Models (VLMs) when confronted with deliberately misleading textual information that conflicts with clear visual evidence. While prior work has examined misinformation in pure language models, the multimodal setting—where both image and text are presented—has received far less attention. To fill this gap, the authors introduce the CONTEXT‑VQA benchmark, a dataset that pairs standard VQA image‑question items with persuasive, contradictory prompts generated using four rhetorical strategies: repetition, logical appeal, credibility appeal, and emotional appeal.

The dataset construction proceeds in three stages. First, 2,000 image‑question pairs are sampled from the A‑OKVQA corpus and evaluated across eleven state‑of‑the‑art VLMs. The subset of 920 items that every model answers correctly (100 % accuracy) is retained, ensuring that any subsequent performance change can be attributed solely to textual manipulation. Second, for each question an “incorrect target” (the second‑most confident wrong answer) is selected as the non‑fact (NF). Using Gemini 2.5‑Pro, the authors automatically generate four distinct persuasive texts that argue for the NF, following handcrafted templates for each rhetorical style. Human annotators then validate and filter the outputs. Third, a multi‑round conversational evaluation framework is defined: (I) baseline performance is recorded; (II) in each round the entire prior dialogue (image, question, previous persuasions, and model responses) is concatenated with a new persuasive message, mimicking a sustained dialogue where a user repeatedly feeds misinformation; (III) after all rounds, final accuracy and confidence shifts are measured.

Eleven models are tested, including open‑source variants (Qwen‑VL‑2.5‑3B/7B, InternVL‑3‑1B/2B/8B, LLaVA‑OneVision‑0.5B/7B) and proprietary systems (Gemini‑2.5‑Flash/Pro, GPT‑4o‑mini, GPT‑4o). All models achieve perfect scores on the filtered set before any persuasion. After a single round of misleading text, average accuracy drops dramatically: repetition 50.9 %, logical 43.5 %, credibility 51.3 %, emotional 61.5 %—an overall reduction of more than 48 % relative to the baseline. Emotional appeals cause the steepest decline for most models, while logical and credibility appeals also produce substantial drops. Notably, even the strongest proprietary models (e.g., Gemini‑2.5‑Pro) suffer large degradations under emotional persuasion (down to 84 % accuracy) despite maintaining high performance under logical or credibility attacks.

The multi‑round experiments reveal that a single persuasive input can permanently shift a model’s belief, and subsequent rounds tend to reinforce the induced error rather than correct it. This suggests that VLMs lack robust mechanisms for re‑evaluating visual evidence when faced with contradictory textual cues, effectively over‑weighting recent textual inputs.

The authors highlight several contributions: (1) the first systematic study of textual misinformation in multimodal VLMs; (2) the release of the CONTEXT‑VQA dataset with diverse rhetorical attacks; (3) a novel multi‑turn evaluation protocol that captures sustained manipulation effects. They also acknowledge limitations: the benchmark is limited to multiple‑choice format, the persuasive texts are generated by a single LLM (potentially biasing style), and the visual content is relatively unambiguous, leaving open questions about performance on more complex or abstract scenes.

Future work is suggested in three directions: (a) designing cross‑modal attention or consistency checks that explicitly compare textual claims with visual features; (b) adversarial training regimes that expose VLMs to conflicting textual inputs during fine‑tuning; and (c) incorporating human‑in‑the‑loop verification to detect and correct misinformation during deployment. Overall, the paper underscores a critical security and reliability gap in current VLMs and provides a valuable testbed for developing more robust multimodal reasoning systems.

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment