Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models
Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce M2CQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness. We will make the experimental resources and dataset publicly available for the community.
💡 Research Summary
The paper addresses a subtle yet critical failure mode of vision‑language models (VLMs): the tendency to accept culturally plausible but visually incorrect statements, a phenomenon the authors term “counterfactual hallucination.” Existing hallucination benchmarks focus almost exclusively on Western images and English text, leaving a gap in evaluating models on non‑Western visual contexts and multilingual inputs. To fill this gap, the authors introduce M2CQA (Multimodal MENA Contrastive Question‑Answering), a benchmark built from images collected across 17 Middle‑East and North‑Africa (MENA) countries. For each image, they create one true statement (Q⁺) and two counterfactual statements (Q⁻) that are culturally plausible yet unsupported by the visual evidence. These statements are provided in four language variants: English, Modern Standard Arabic (MSA), Egyptian Arabic, and Levantine Arabic, allowing a systematic study of multilingual effects.
A central contribution is the CounterFactual Hallucination Rate (CFHR), a conditional metric defined as the proportion of cases where a model, after correctly answering the true statement, still accepts at least one false counterfactual. Formally, CFHR = (Acc(Q⁺) – Acc(combined)) / Acc(Q⁺), where Acc(combined) measures joint correctness on Q⁺ and all Q⁻. This metric isolates hallucination that occurs despite successful visual grounding, unlike prior metrics (POPE, CHAIR) that treat each sample independently.
The dataset construction pipeline is rigorous: images are sourced via geo‑targeted Google searches, multiple‑choice questions are generated with GPT‑4.1, and any MCQ answerable without the image (as judged by text‑only models) is filtered out. Counterfactual statements are derived from the incorrect MCQ options, translated into MSA and dialects using a combination of in‑house MT and GPT‑4.1, and a subset is manually verified for visual support and image necessity. The final benchmark contains 9,990 samples, each with one image, one Q⁺, and two Q⁻ statements.
Experiments evaluate several state‑of‑the‑art VLMs: Qwen‑VL (2B–32B), Gemma‑VL (4B–27B), and Arabic‑focused models Fanar‑Oryx and AIN, across three prompting strategies—(1) direct True/False, (2) Answer‑then‑Reason (answer first, then provide justification), and (3) Reasoning‑First (reason before committing to an answer). Results answer five research questions:
- Standard accuracy metrics mask counterfactual hallucination. Models often achieve high Q⁺ accuracy and respectable F1 scores while still exhibiting large CFHR values, indicating they accept plausible false statements after a correct answer.
- Language and dialect matter. CFHR rises from English to MSA and further in Egyptian and Levantine dialects, even when Q⁺ accuracy remains stable, suggesting linguistic uncertainty amplifies reliance on cultural priors.
- Model families differ. Qwen‑3‑VL consistently shows lower CFHR than Gemma‑3‑VL. Among Arabic‑focused models, Fanar‑Oryx tends to reject both true and false statements (lower CFHR but also lower Q⁺ accuracy), whereas AIN maintains high Q⁺ accuracy but suffers from very high CFHR, reflecting a propensity to accept culturally tempting alternatives.
- Prompting strategy influences hallucination. “Answer‑then‑Reason” generally reduces CFHR, likely because the model must justify its answer, whereas “Reasoning‑First” often increases CFHR, especially in Arabic and dialects, as the model may generate culturally biased reasoning before seeing the image.
- Scaling helps but not uniformly. Larger model sizes tend to lower CFHR, particularly for Qwen‑3‑VL, while Gemma‑3‑VL shows diminishing returns after a certain scale.
The authors conclude that evaluating VLMs solely on raw accuracy is insufficient for culturally diverse deployments. CFHR provides a more diagnostic lens, revealing conditional grounding failures that could lead to misinformation or biased behavior in real‑world applications. They advocate for richer multilingual datasets, careful prompt engineering, and scaling strategies that prioritize cultural robustness. Future work is suggested on integrating CFHR into training objectives, expanding dialect coverage, and incorporating human‑in‑the‑loop verification to continuously improve dataset quality and model reliability.
Comments & Academic Discussion
Loading comments...
Leave a Comment