CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

CDH-Bench: A Commonsense-Driv en Hallucination Benchmark f or Evaluating V isual Fidelity in V ision-Language Models K esheng Chen , Y amin Hu , Qi Zhou , Zhenqian Zhu , W enjian Luo Guangdong Provincial K e y Laboratory of No vel Security Intelligence T echnologies, Institute of Cyberspace Security , School of Computer Science and T echnology , Harbin Institute of T echnology , Shenzhen, China huyamin@hit.edu.cn No. T he butt e r f l y ha s 4 wings . No. T he ha nd ha s 5 f inger s . C . T he mug ha s 1 ha ndle. T he a pple ha s 1 s tem D ri v e n by C oun t i ng - CDH An ima l P ar t s An ima l P ar t s Doe s the butt e r f ly ha ve 4 wi n gs a nd not 8 wings ? Doe s the butt e r f ly ha ve 4 wi n gs a nd not 8 wings ? Doe s the ha nd ha ve 5 f inger s a nd not 6 f i nge r s ? Doe s the ha nd ha ve 5 f inger s a nd not 6 f i nge r s ? How many ha ndle s doe s thi s mug ha ve ? - A. 7 ha ndles - B . 2 ha ndles - C . 1 ha ndle - D . . . How many ha ndle s doe s thi s mug ha ve ? - A. 7 ha ndles - B . 2 ha ndles - C . 1 ha ndle - D . . . Doe s the a pple ha ve 1 s te m a nd not 3 s tems ? Doe s the a pple ha ve 1 s te m a nd not 3 s tems ? No. T he br ick wa ll is opa que . I s t he br ick w a ll opa que a nd not tr a ns pa r e nt? I s t he br ick w a ll opa que a nd not tr a ns pa r e nt? T he f lame s a r e or a nge - ye ll ow. Ar e the f lame s or a nge - ye ll ow a nd not blac k? Ar e the f lame s or a nge - ye ll ow a nd not blac k? B . T he glas s is br it tl e . W ha t is the pr ope r ty of thi s glas s ? - A. F lexible - B . B r it tl e - C . P or ous - D. S of t W ha t is the pr ope r ty of thi s glas s ? - A. F lexible - B . B r it tl e - C . P or ous - D. S of t T he r e f r iger a tor c ools f ood. Doe s the r e f r iger a tor c ool f ood a nd not he a t f ood? Doe s the r e f r iger a tor c ool f ood a nd not he a t f ood? No. T he c a t is c ha s ing the m ous e . I s t he c a t c ha s ing the mous e a nd not the mous e c ha s ing the c a t? I s t he c a t c ha s ing the mous e a nd not the mous e c ha s ing the c a t? A. S c is s or s F or the a c ti on 'c utt ing pa pe r ', whic h tool is be ing us e d? C hoos e one . A. S c is s or s B . A phone C . A s poon D. A pe nc il F or the a c ti on 'c utt ing pa pe r ', whic h tool is be ing us e d? C hoos e one . A. S c is s or s B . A phone C . A s poon D. A pe nc il No. T he e le pha nt is much L a r ge r than t he tea c up . I s t he e lepha nt much lar ge r than the tea c up a nd not a bout the s a me s iz e a s the tea c up? I s t he e lepha nt much lar ge r than the tea c up a nd not a bout the s a me s iz e a s the tea c up? B . T he de s k is on the f loor . C hoos e the c ha ir 's loca ti on. A. F loating in the a i r B . On the f loor C . Unde r the f loor - D. . . C hoos e the c ha ir 's loca ti on. A. F loating in the a i r B . On the f loor C . Unde r the f loor - D. . . D ri v e n by R e a l a t i o n - CDH D rive n by A t t r i bu t e - CDH Bo d y P ar t s Bo d y P ar t s E ve r yDa y P ar t s E ve r yDa y P ar t s P lant S t r u ctu r e P lant S t r u ctu r e P h y sic al S t ate P h y sic al S t ate T emp e r atu r e T emp e r atu r e Co lor Co lor An ima l Be h a vior An ima l Be h a vior O b jec t Fu n c t ion O b jec t Fu n c t ion S iz e S c ale S iz e S c ale S p a t ial S p a t ial S p a t ial L u min es ce n c e L u min es ce n c e Abstract V ision-language models (VLMs) achieve strong perfor - mance on many benchmarks, yet a basic reliability ques- tion remains underexplored: when visual evidence con- ﬂicts with commonsense, do models follo w what is shown or what commonsense suggests? A characteristic f ailure in this setting is that the model overrides visual evidence and outputs the commonsense alternative. W e term this phenomenon commonsense-driven hallucination (CDH). T o ev aluate it, we introduce CDH-Bench , a benchmark designed to create explicit visual evidence–commonsense conﬂicts . CDH-Bench covers three dimensions: counting anomalies , r elational anomalies , and attrib ute anomalies . W e ev aluate frontier VLMs under binary Question Answer - ing (QA) and multiple-choice QA , and report metrics includ- ing Counterfactual Accuracy (CF-Acc), Commonsense Ac- curacy (CS-Acc), Counterfactual Accuracy Dr op (CF AD), Commonsense Collapse Rate (CCR), and Relative Prior De- pendency (RPD). Results show that e ven strong models re- main vulnerable to prior-driv en normalization under visual evidence–commonsense conﬂict. CDH-Bench provides a controlled diagnostic of visual ﬁdelity under visual e vidence– commonsense conﬂict. Data are av ailable at: § GitHub and * Hugging Face. 1 Introduction “I see what I believe, not what I see. ” This classic human bias increasingly appears in modern VLMs. These models excel at standard image captioning and V isual Question & Answering (VQA) [ 2; 6; 10 ] , but reliability hinges on a harder scenario: what happens when visual evidence conﬂicts with commonsense? Consider an example in medical imaging: an X-ray re- veals an anatomical anomaly , such as a hand with six ﬁn- gers. A competent radiologist should describe what the im- age actually shows (evidence) rather than what is normally expected (commonsense). Ho wev er , when the same image is presented to a VLM and asked, “How many ﬁngers does the hand have?” the model may answer “ﬁve. ” This failure does not reﬂect difﬁculty in perception, but rather that common- sense ov errides visual e vidence. This is not the usual “object hallucination” studied in prior benchmarks [ 9; 13; 5 ] , where models fabricate objects ab- sent from the image. Instead, we study a different hallucina- tion mode: the rele v ant visual e vidence is present (and often salient), but the model still defaults to what is typically true. In other words, when e vidence and commonsense conﬂict, the output snaps back to the commonsense alternative rather than reﬂecting the image. W e call this commonsense-driven hallucination (CDH) : a prior -driv en normalization in which models report the commonsense alternative instead of the vi- sual evidence. A useful perspectiv e is that VLMs implicitly arbitrate between competing signals: a visual per ception channel grounded in the image and a commonsense-prior chan- nel that fa vors high-probability completions learned from a commonsense-dominated distribution. When these channels conﬂict, generation can behav e like semantic smoothing : gen- uine but commonsense-violating evidence is treated as noise and smoothed tow ard a commonsense semantic prior (e.g., six ﬁngers to ﬁve; blue banana to yellow). This perspectiv e predicts the experimental result we observe: CDH persists ev en in strong frontier models and becomes especially visible when the commonsense alternativ e is e xplicitly presented. CDH matters most where anomalies matter: medical imag- ing, quality inspection, scientiﬁc discovery , and forensics. A system that “normalizes” unusual evidence can be dan- gerously conﬁdent while wrong. Y et current e v aluation paradigms are poorly suited to detect CDH. Most bench- marks use commonsense-consistent imagery , so visual evi- dence and commonsense priors typically agree [ 9; 13; 14; 5 ] . Hallucination benchmarks often suf fer from ambiguous error attribution: when a model answers incorrectly , it is un- clear whether it guessed from priors or truly misperceived the image. W e therefore need an ev aluation that creates conﬂict on purpose and supports unambiguous err or attrib ution . W e introduce CDH-Bench , a benchmark designed to sys- tematically ev aluate visual ﬁdelity under visual evidence– commonsense conﬂict. Our contributions are: (1) Formalization of CDH. W e deﬁne commonsense- driven hallucination (CDH) as a distinct reliability failure pattern in which models override visual evidence with learned priors, distinguishing it from standard object fabrication. (2) The CDH-Bench. W e construct a diagnostic bench- mark consisting of 300 paired counterfactual–commonsense images, designed speciﬁcally for controlled comparison and unambiguous error attribution. (3) Systematic e valuation and scaling analysis. W e ev al- uate frontier VLMs under binary QA and multiple-choice QA settings, report metrics including CF-Acc , CS-Acc , CF AD , CCR , and RPD , and conduct a controlled analysis of Qwen3- VL Instruct and Thinking variants [ 16; 3 ] to examine ho w scale and reasoning style affect CDH susceptibility . 2 Related W ork General Ability Evaluation for V ision-Language Models. Standard VLM benchmarks ev aluate broad multimodal capa- bilities. VQA v2 [ 2 ] and GQA [ 6 ] test visual question answer- ing, which requires visual evidence from the input image. OK-VQA [ 11 ] requires external knowledge integration, while comprehensiv e benchmarks such as SEED-Bench [ 8 ] and MMBench [ 10 ] assess multiple dimensions of visual under- standing. These benchmarks have driv en substantial progress, with recent models approaching human performance on many tasks. Sev eral benchmarks probe speciﬁc perceptual abili- ties: T allyQA [ 1 ] ev aluates counting, CLEVR [ 7 ] tests com- positional reasoning and spatial understanding in synthetic scenes, and V A W [ 12 ] assesses attrib ute recognition across a large in v entory of visual concepts. These datasets are valu- able for measuring ﬁne-grained perception, but the y typically test perception in settings where commonsense priors support the task rather than compete with it. Howe v er , most of them operate in regimes where visual evidence and commonsense priors are naturally aligned. Hallucination Evaluation for VLMs. Hallucination benchmarks measure how models generate content unsup- ported by visual input. POPE [ 9 ] uses binary existence ques- tions to ev aluate object hallucination, while CHAIR [ 13 ] measures hallucinated objects in image captions. MMHal- Bench [ 14 ] and HallusionBench [ 5 ] provide broader evalua- tions across spatial relations and attributes. Ho wev er , most prior work primarily targets fabrication hallucination . Our focus is different: we study normalization hallucination , where present visual evidence is overridden by the common- sense alternativ e. Evaluation with Counterfactual and Compositional Reasoning. Counterfactual VQA [ 4 ] synthesizes counterfac- tual questions to reduce language bias during training, typi- cally by altering the question while keeping the image ﬁxed so that models cannot rely on superﬁcial linguistic correla- tions alone. W inoground [ 15 ] probes visio-linguistic com- positionality with a carefully hand-curated setup of two im- ages and two captions, where the captions contain the same words but in different orders, requiring models to correctly align each image with its caption. These benchmarks are re- lated in spirit, but they differ from our setting. CDH-Bench constructs counterfactual visual content itself, making the visual evidence–commonsense conﬂict a direct property of the image rather than of the linguistic query alone. 3 Benchmark Design 3.1 Design Motivations CDH-Bench is motiv ated by critical gaps in existing ev alua- tion paradigms that underestimate a fundamental failure pat- tern in VLMs. The evaluation blind spot under commonsense- consistent distributions. Standard benchmarks such as VQA v2 [ 2 ] , GQA [ 6 ] , SEED-Bench [ 8 ] , and MMBench [ 10 ] are largely built on natural-image distributions in which visual evidence and commonsense priors usually agree. As a result, benchmark success does not clearly distinguish gen- uine visual grounding from reliance on statistical regularities that happen to match the test distribution. Fine-grained [ 7; 1 ] perception benchmarks test many of the same underlying skills, but typically under settings where commonsense priors are helpful rather than adversarial. This creates an e valuation blind spot: standard benchmarks effecti vely ask, Can the model r ecognize what is there under commonsense-consistent conditions? In contrast, CDH-Bench asks a more diagnostic question: Can the model still r eport what is there when visual evidence conﬂicts with commonsense? Fabrication vs. normalization hallucination. Existing hallucination benchmarks [ 9; 13 ] mainly target fabrication hallucination , where the model mentions entities, attributes, or relations that are absent from the image. CDH-Bench tar- gets a different failure pattern: the model ignores present vi- sual evidence and outputs the commonsense alternativ e as if the anomaly were noise. A six-ﬁngered hand, for example, is not visually absent; it is visually present but conﬂicts with commonsense. CDH-Bench is designed to expose this nor- malization -style failure directly . The error attrib ution pr oblem. Many hallucination benchmarks also suffer from ambiguous error attribution. When a model answers incorrectly , the failure may come from weak perception, prior -based guessing, or confusion in- duced by the task format. W ithout a matched control con- dition, these sources are difﬁcult to disentangle. By pairing each counterfactual image with a commonsense counterpart that preserves the ov erall scene while changing only the tar- get anomaly , CDH-Bench enables cleaner attribution of prior- driv en failure under visual e vidence–commonsense conﬂict. 3.2 Design Principles Building on these motiv ations, CDH-Bench is designed around four principles: Pair ed data design f or unambiguous attribution. Each counterfactual image is paired with a commonsense counter- part that differs only in the target anomaly . This controlled comparison enables stronger causal interpretation: if a model succeeds on the commonsense image but fails on the coun- terfactual image, the error is best explained as a prior-dri ven ov erride rather than a generic incapacity . Multi-dimensional coverage. W e cov er three fundamen- tal dimensions of commonsense: quantity , relation, and at- tribute. This enables us to analyze how prior-dri v en normal- ization behav es across dif ferent semantic aspects. T ask-controlled evaluation. W e employ two task formats to probe CDH under different lev els of cognitive competition. In binary QA , the question text explicitly contrasts the coun- terfactual visual evidence with its commonsense alternativ e, directly measuring whether the model can prioritize anoma- lous evidence within a minimal output space. Multiple-choice QA introduces a higher lev el of difﬁculty by explicitly pre- senting the commonsense alternativ e as a competiti ve answer choice, which is particularly ef fecti v e for diagnosing system- atic collapse to learned priors. Fine-grained error taxonomy . W e classify errors into commonsense err ors and other error s . This distinction, quan- tiﬁed through CCR, pro vides a sharper diagnostic signal than accuracy alone, where direct answer competition makes the interpretation most transparent. 3.3 Counterfactual Dimensions W e construct 600 images, org anized as 300 counterfactual– commonsense pairs. Figure 1 sho ws the taxonomy and distri- bution of CDH-Bench. Figure 1: T axonomy and distrib ution of CDH-Bench. Counting anomalies (100 pairs). This dimension focuses on unusual but visually unambiguous quantities that conﬂict with biological or structural commonsense, spanning cases such as six-ﬁngered hands, animals with extra legs, and ev- eryday objects with atypical component counts. Relational anomalies (100 pairs). This dimension fo- cuses on inv erted relations between entities that violate typi- cal behavioral or physical scripts, spanning behavioral rev er- sals (e.g., mice hunting cats), spatial in v ersions (e.g., upside- down houses), size re v ersals, and causal re versals. Attribute anomalies (100 pairs). This dimension fo- cuses on objects possessing properties that violate fundamen- tal physical expectations, spanning color anomalies (e.g., blue bananas), material anomalies (e.g., transparent wood), tem- perature conﬂicts, and atypical physical states. 3.4 Image Generation and Quality Control W e generate images using strong text-to-image models with prompts engineered to counteract the generator’ s common- sense bias. In particular , prompts are designed to make the in- tended anomaly necessary rather than optional by combining: (1) explicit structural constraints (e.g., exact counts, left-to- right enumeration, symmetry/asymmetry requirements, and positional anchors such as “center” or “above”), (2) decom- posed anatomical/r elational descriptions (e.g., specifying per-part attributes like nails/knuckles/iris detail and clarify- ing spatial relations among parts), and (3) physical support cues (e.g., viewpoint, lens/shot type, lighting, depth-of-ﬁeld, and photorealistic te xture) to stabilize realism and reduce un- intended artifacts. For each anomaly type, we construct a matched common- sense counterpart by keeping composition, vie wpoint, subject identity , background, and photographic style as consistent as possible, while modifying only the minimal attribute required by the counterfactual (typically a single count-based change). binary QA : Does the mouth have 2 rows of teeth and not 3 rows? multi-choice QA : How many rows of teeth are visible? Options: A. 3 rows, B. 4 rows, C. 2 ro ws, D. 1 row Category : Counting - Body Parts binary QA : Does the chicken hav e 2 wings and not 6 wings? multi-choice QA : How many wings does this chicken have? Options: A. 4 wings, B. 6 wings, C. 2 wings, D. 8 wings Category : Counting - Animal Parts binary QA : Is the whale swimming in the ocean and not in a ﬁsh tank? multi-choice QA : What is unusual about the whale’s en vironment? Options: A. ..., B. The whale is contained in a small ﬁsh tank, C. The whale is swimming in the ocean normally , D. ... Category : Relational - Size Scale binary QA : Is the foam light and airy and not dense and heavy? multi-choice QA : What is the density of this foam? Options: A. Dense and heavy , B. Medium, C. Solid, D. Light and airy Category : Attribute - Material Figure 2: CDH-Bench sho wcase: examples from diverse dimen- sions. Each subﬁgure displays a counterfactual (CF) image on the left and its commonsense (CS) counterpart on the right, together with the associated binary QA and multi-choice QA ev aluation de- tails. W e also add mild redundancy in the text (e.g., stating “exactly k ” and explicitly listing parts) to improv e controllability and make the anomaly salient under realistic rendering. Quality control includes: (i) human v eriﬁcation to con- ﬁrm that the anomaly is unambiguous and visually central, and that the paired commonsense image is normal with no additional unexpected anomalies; (ii) pair -level matching to ensure the counterfactual and commonsense images are closely aligned in pose, illumination, and ov erall style. 3.5 T ask Formats and Ev aluation W e ev aluate each image pair under two task formats with format-speciﬁc scoring protocols. Although the data is or - ganized into counterfactual–commonsense pairs, each image is processed independently during inference: the model re- ceiv es only a single image as input (either a counterfactual image or a commonsense image) without any access to its paired counterpart. Binary QA. This format uses compositional yes/no ques- tions that explicitly contrast the counterfactual target with its commonsense alternati ve (e.g., “Does the mouth have 2 rows of teeth and not 3 rows?”). Crucially , this design enables a uniﬁed query for both counterfactual (CF) and commonsense (CS) images within each pair . By maintaining an identical question text across both conditions, we eliminate potential linguistic biases or prompt-induced variances that could oth- erwise confound the comparison. This ensures that perfor- mance gaps and relati ve metrics are strictly attributable to the visual–commonsense conﬂict in the imagery rather than dif- ferences in the textual phrasing, providing a more rigorous measure of visual ﬁdelity . Multiple-Choice QA. Each question contains four answer options: one correct answer for the current counterfactual im- age, one commonsense alternative, and two additional plau- sible distractors. The option order is randomly shufﬂed for each instance to av oid positional bias. This format is espe- cially diagnostic because it makes the commonsense alterna- tiv e explicit and places it in direct competition with the visu- ally correct answer . For both formats, we ev aluate performance separately on CF and CS images and then deriv e metrics that characterize robustness under visual evidence–commonsense conﬂict as well as some error attrib ution metrics across paired instances. Each of the 300 image pairs is ev aluated under two image conditions (counterfact ual and commonsense) and two task formats, yielding 300 × 2 × 2 = 1 , 200 ev aluated instances in total. 4 Evaluation Metrics W e design metrics around three concrete questions that stan- dard accuracy alone cannot disentangle. For each question, we introduce the metric(s) that directly answer it. Q1: How well does a model perform under visual evidence–commonsense conﬂict? Counterfactual Accuracy (CF-Acc). CF-Acc is the ac- curacy on counterfactual (CF) images, and is our primary measure of visual ﬁdelity under conﬂict. It directly answers Q1 by measuring whether the model follows visual evidence when that evidence conﬂicts with commonsense. Q2: How large is the collapse relativ e to normal- condition capability? Commonsense Accuracy (CS-Acc). CS-Acc is the accurac y on matched commonsense (CS) counterparts. It measures the model’ s general multimodal capabilities, speciﬁcally its abil- ity to perform basic visual observation and standard common- sense reasoning in settings where visual evidence and priors are naturally aligned. This serves as a necessary baseline and capability control under non-conﬂicting conditions. Counterfactual Accur acy Dr op (CF AD). W e deﬁne the ac- curacy drop from the commonsense condition to the counter- factual condition as CF AD = CS-Acc − CF-Acc . CF AD answers Q2 by quantifying the absolute degradation under visual evidence–commonsense conﬂict (lar ger positi ve CF AD indicates stronger degradation). Relative Prior Dependency (RPD). T o compare collapse strength across models with dif ferent baselines, we normalize CF AD by CS-Acc : RPD = CF AD CS-Acc . RPD also answers Q2 , but in relativ e terms: of what the model can do when visual e vidence and commonsense agree, how much is lost when the y conﬂict? Q3: When the model fails on counterfactual images, does it fail speciﬁcally by r everting to the commonsense alternativ e? Error classiﬁcation. When a model produces an incorrect answer on a counterfactual image, we classify the error as (1) a commonsense err or if the response matches the common- sense alternativ e, and (2) an other err or otherwise. Commonsense Collapse Rate (CCR). CCR measures the fraction of CF errors that collapse speciﬁcally to the com- monsense alternativ e: CCR = # Commonsense-Aligned Errors on CF Images # T otal Errors on CF Images . CCR answers Q3 by measuring how often counterfactual fail- ures are speciﬁcally commonsense-aligned rather than arbi- trary . In binary QA, this corresponds to predicting the re- sponse consistent with the CS counterpart on a CF image. In multiple-choice QA, this corresponds to selecting the ex- plicit commonsense alternativ e among the answer options. Although CCR is deﬁnable in both formats, we report it in the main tables only for multiple-choice QA, where direct answer competition makes the interpretation more transparent. 5 Experiments 5.1 Models Evaluated W e e v aluate eight frontier multimodal systems in the main comparison: gemini-3.1-pr o-pre view , gemini-3.1-ﬂash-lite- pr evie w , doubao-seed-1.8-251228 , qwen3.5-plus , kimi-k2.5 , gemini-2.5-ﬂash-all , gpt-5.4-mini , and gpt-5.4 . For controlled family analysis, we further ev aluate Qwen3- VL v ariants and di vide them into two groups: Instruct and Thinking [ 16; 3 ] . This setup allows us to compare not only performance across scales, but also the effect of reasoning style within a shared model family . Unless stated otherwise, all models are ev aluated under a uniﬁed protocol: identical prompts per task format, ﬁxed de- coding settings where applicable, and deterministic answer extraction for multiple-choice QA. 5.2 Evaluation Granularity and Aggr egation W e report three le vels of analysis. T ask-level aggregation. For the main comparison, we report two task-speciﬁc vie ws: QA (binary QA) and MC (multiple-choice QA). For each task, we provide both overall results and category-level breakdowns over Attribute Anoma- lies , Counting Anomalies , and Relational Anomalies . Subcategory-level diagnosis. Be yond the primary cate- gories, we further analyze ﬁne-grained subcate gories to iden- tify which anomaly types remain consistently difﬁcult across models. This analysis is especially useful because category- lev el av erages can hide concentrated bottlenecks such as body-part counting. Family-le vel scaling and reasoning analysis. For the Qwen3-VL study , we merge QA and MC into a uniﬁed com- parison table and report metrics grouped by task and ov er - all aggregation. This presentation makes it easier to com- pare scale effects and reasoning-style effects within a shared model family . 5.3 Main Results T able 1 presents the category-le v el results on CDH-Bench, while Figures 3 – 6 visualize model performance in terms of counterfactual accuracy and the accuracy drop from com- monsense to counterfactual conditions. Figure 3: Model performance comparison on counterfactual accu- racy (CF-Acc) across dif ferent subcategories in the binary QA task. Figure 4: Counterfactual Accuracy Drop (CF AD) across subcate- gories in the binary QA task. Figure 5: Model performance comparison on counterfactual accu- racy (CF-Acc) across dif ferent subcategories in the multiple-choice QA task. CDH induces a consistent and substantial rob ustness gap across models and task formats. For nearly all ev alu- ated models, accuracy on counterfactual images is lower than Figure 6: Counterfactual Accuracy Drop (CF AD) across subcate- gories in the multiple-choice QA task. on matched commonsense controls, and this pattern holds in both binary QA and multiple-choice QA. Concretely , 7 out of 8 models show lower overall CF-Acc than CS-Acc in both settings; the only partial exception is gemini-3.1-pr o- pr evie w , which exhibits a slightly negati v e CF AD in QA but still sho ws a 7.33% drop in MC. A v eraged across models, the ov erall CF AD is 16.39% in QA and increases to 25.20% in MC, while mean CF-Acc decreases from 75.24% to 66.90% . These results suggest that CDH is not an isolated failure mode of a fe w weak systems, but a systematic robustness challenge that persists across model families. Multiple-choice QA further ampliﬁes prior-dri ven fail- ures. Compared with binary QA, MC generally yields both lower CF-Acc and larger CF AD , indicating that when coun- terfactual evidence must directly compete with a plausible commonsense alternativ e, models are more likely to rev ert to prior-consistent answers. On average, moving from QA to MC lo wers CF-Acc by 8.34 points and increases CF AD by 8.81 points. This trend is especially pronounced for mod- els such as kimi-k2.5 and gpt-5.4-mini , which see their CF AD values rise from 17.67% to 29.33% and from 29.33% to 41.00%, respectively . T aken together , the QA–MC contrast suggests that CDH is not only a robustness issue under uncer- tainty , but also a comparativ e reasoning failure that emerges when image-grounded counterfactual evidence must ov erride a strong commonsense prior . Model-level comparisons further suggest that smaller degradation does not necessarily imply stronger counter - factual understanding, and that CDH-Bench exposes a structured failure landscape beyond aggregate accuracy . For instance, GPT -5.4 is weakest on Counting / Body P arts in both QA and MC, whereas doubao-seed-1.8-251228 remains particularly fragile on T emperatur e . At the same time, relati v e robustness metrics such as CF AD should be interpreted jointly with performance on matched commonsense controls: among the stronger baselines, gemini-3.1-pr o-pr eview exhibits the smallest overall CF AD in QA, but it also attains the lowest ov erall CS-Acc among the frontier models evaluated (e.g., 87.67% in QA). Thus, its smaller degradation may reﬂect weaker commonsense-to-counterfactual collapse rather than clear absolute superiority in counterfactual visual grounding. A plausible explanation is that, in our binary QA setting, the question text simultaneously contains a strong common- sense prior (e.g., “normal” expectations) and an explicit coun- terfactual claim grounded in the image. Models that are more conservati v e or refusal-prone under potential inconsistency may answer “no” (or abstain) more often even on common- sense controls, lowering CS-Acc and mechanically shrinking the measured CF AD . Overall, CDH-Bench rev eals a pervasiv e and funda- mental vulnerability across current frontier VLMs, where even the strongest systems fail to prioritize visual evi- dence over entrenched priors. The primary importance of this benchmark lies not merely in reporting lower accuracy scores, but in pro viding a rigorous diagnostic frame work that aggregate metrics alone cannot of fer . By decoupling prior- driv en collapse from generic perception errors through its paired counterfactual–commonsense design, CDH-Bench en- ables a structured analysis of failure modes across diverse semantic dimensions. Furthermore, the introduction of spe- cialized metrics such as CF AD, CCR, and RPD allows for unambiguous error attribution, re vealing that model failures are often systematic normalizations to w ard the commonsense prototype rather than random noise. These ﬁndings establish CDH-Bench as a critical tool for identifying the hidden relia- bility gaps in multimodal systems and highlight the necessity of e v aluating visual ﬁdelity under explicit visual e vidence– commonsense conﬂict to guide the dev elopment of more ro- bust vision-language models. 5.4 Scale Study: Do Scale and Reasoning Style Mitigate CDH? Motivation. A key question is whether model scale and rea- soning style mitigate commonsense-driv en hallucination un- der counterfactual visual evidence. A natural hypothesis is that larger models should achiev e stronger visual discrimina- tion, while reasoning-oriented variants may better preserve ﬁ- delity when visual e vidence conﬂicts with learned priors. W e test this hypothesis within the Qwen3-VL family . Models and reporting . W e study two Qwen3-VL se- ries [ 16; 3 ] : Instruct and Thinking. T able 2 reports binary QA, multiple-choice QA, and overall results in a uniﬁed lay- out, and Figures 7 and 8 visualize the scaling trends for MC and QA, respectiv ely . Scale helps clearly in the Instruct series, but remains less stable in the Thinking series. Within the Instruct se- ries, overall CF-Acc rises monotonically from 36.17% (2B) to 43.17% (4B), 43.50% (8B), and 50.83% (32B). Thus, larger scale is generally beneﬁcial for the Instruct models, although the gains are modest between some intermediate sizes. By contrast, the dense Thinking series is less mono- tonic at smaller and medium scales: o verall CF-Acc changes from 53.17% (2B) to 51.17% (4B) and 51.50% (8B), before improving substantially to 58.41% at 32B. Therefore, scale alone does not guarantee a smooth reduction in CDH suscep- tibility , especially within the Thinking line. Reasoning style impro ves accuracy more consistently than collapse-oriented error metrics. Although Think- ing models consistently raise CF-Acc , their advantages on collapse-oriented metrics such as CCR and RPD are less uni- form. For example, at 4B and 32B, Thinking slightly reduces T able 1: Main results on CDH-Bench. Higher CF-Acc and CS-Acc are better; lower CF AD , CCR , and RPD are better . T o emphasize failure analysis rather than leaderboard ranking, selected notably weak results are marked in red. CCR is reported only for Multiple- choice QA in the main table. All values are percentages. Gray rows denote Overall results. Model Cat. Binary QA Multiple-choice QA CF ↑ CS ↑ CF AD ↓ RPD ↓ CF ↑ CS ↑ CF AD ↓ CCR ↓ RPD ↓ gemini-3.1-pro-previe w Overall 90.33% 87.67% -2.67% -3.04% 81.67% 89.00% 7.33% 58.18% 8.24% Attr . 85.00% 83.00% -2.00% -2.41% 84.00% 93.00% 9.00% 81.25% 9.68% Count. 96.00% 87.00% -9.00% -10.34% 79.00% 80.00% 1.00% 42.86% 1.25% Rel. 90.00% 93.00% 3.00% 3.23% 82.00% 94.00% 12.00% 55.56% 12.77% gemini-3.1-ﬂash-lite-previe w Overall 83.67% 91.33% 7.67% 8.39% 71.67% 94.33% 22.67% 71.76% 24.03% Attr . 78.00% 86.00% 8.00% 9.30% 69.00% 93.00% 24.00% 77.42% 25.81% Count. 94.00% 93.00% -1.00% -1.08% 69.00% 94.00% 25.00% 58.06% 26.60% Rel. 79.00% 95.00% 16.00% 16.84% 77.00% 96.00% 19.00% 82.61% 19.79% doubao-seed-1.8-251228 Overall 76.67% 91.67% 15.00% 16.36% 64.67% 91.67% 27.00% 77.36% 29.45% Attr . 70.00% 86.00% 16.00% 18.60% 54.00% 94.00% 40.00% 86.96% 42.55% Count. 80.00% 93.00% 13.00% 13.98% 71.00% 88.00% 17.00% 65.52% 19.32% Rel. 80.00% 96.00% 16.00% 16.67% 69.00% 93.00% 24.00% 74.19% 25.81% qwen3.5-plus Overall 75.87% 93.98% 18.11% 19.27% 74.37% 95.55% 21.18% 81.69% 22.17% Attr . 75.51% 88.00% 12.49% 14.19% 75.51% 93.94% 18.43% 95.83% 19.62% Count. 77.78% 96.97% 19.19% 19.79% 74.70% 95.88% 21.18% 66.67% 22.09% Rel. 74.49% 97.00% 22.51% 23.21% 72.92% 96.88% 23.96% 80.77% 24.73% kimi-k2.5 Overall 75.67% 93.33% 17.67% 18.93% 64.67% 94.00% 29.33% 85.85% 31.21% Attr . 71.00% 89.00% 18.00% 20.22% 69.00% 94.00% 25.00% 96.77% 26.60% Count. 76.00% 99.00% 23.00% 23.23% 47.00% 93.00% 46.00% 79.25% 49.46% Rel. 80.00% 92.00% 12.00% 13.04% 78.00% 95.00% 17.00% 86.36% 17.89% gemini-2.5-ﬂash-all Overall 72.00% 88.33% 16.33% 18.49% 66.00% 87.33% 21.33% 74.51% 24.43% Attr . 62.00% 80.00% 18.00% 22.50% 58.00% 92.00% 34.00% 83.33% 36.96% Count. 81.00% 92.00% 11.00% 11.96% 65.00% 76.00% 11.00% 74.29% 14.47% Rel. 73.00% 93.00% 20.00% 21.51% 75.00% 94.00% 19.00% 60.00% 20.21% gpt-5.4-mini Overall 64.00% 93.33% 29.33% 31.43% 53.33% 94.33% 41.00% 78.57% 43.46% Attr . 65.00% 90.00% 25.00% 27.78% 53.00% 94.00% 41.00% 91.49% 43.62% Count. 65.00% 95.00% 30.00% 31.58% 43.00% 91.00% 48.00% 66.67% 52.75% Rel. 62.00% 95.00% 33.00% 34.74% 64.00% 98.00% 34.00% 80.56% 34.69% gpt-5.4 Overall 63.67% 93.33% 29.67% 31.79% 58.86% 94.65% 35.79% 73.17% 37.81% Attr . 68.00% 86.00% 18.00% 20.93% 58.00% 97.00% 39.00% 88.10% 40.21% Count. 61.00% 97.00% 36.00% 37.11% 48.00% 90.00% 42.00% 53.85% 46.67% Rel. 62.00% 97.00% 35.00% 36.08% 70.71% 96.97% 26.26% 86.21% 27.08% Attr . = Attribute; Count. = Counting; Rel. = Relational. CCR is reported only for multiple-choice QA in the main table. Red numbers indicate selected notably weak results to facilitate failure analysis. MC CCR relati ve to Instruct, but at 8B it yields a higher CCR (89.33% vs. 78.74%). Similarly , improv ements in RPD are clearer at some scales than others. This suggests that reasoning-oriented generation reliably improv es success rates under conﬂict, but does not uniformly change the exact form of model failures. 6 Conclusion W e formalize commonsense-driven hallucination (CDH) and introduce CDH-Bench , a paired counterfactual-image benchmark that makes prior-driv en normalization measur- able and attrib utable. Across frontier VLMs, we observe large and systematic collapses from commonsense-consistent to counterfactual imagery , revealing a pervasi ve and funda- mental vulnerability where even the strongest systems fail to prioritize visual evidence ov er entrenched priors. By de- coupling prior -dri ven collapse from generic perception errors through specialized metrics such as CF AD , CCR , and RPD , we demonstrate that model failures are often systematic nor- malizations toward commonsense priors rather than random noise. Our experiments show that CDH is not a uniform phe- nomenon. At the category lev el, counting anomalies are the most persistent source of degradation, especially in multiple- choice QA where explicit competition with commonsense alternativ es triggers stronger collapse. At the subcategory T able 2: Results of the Qwen3-VL series on CDH-Bench. All values are reported as percentages. Higher CF-Acc and CS-Acc are better; lower CF AD, CCR, and RPD are better . W e do not highlight best values; instead, selected notably weak results are mark ed in red. The Overall columns are shaded for readability . Model Binary QA Multiple-choice QA Overall CF (%) CS (%) CF AD (%) RPD (%) CF (%) CS (%) CF AD (%) CCR (%) RPD (%) CF (%) CS (%) CF AD (%) CCR (%) RPD (%) Qwen3-VL-2B-Instruct 40.33 87.67 47.33 53.99 32.00 74.00 42.00 73.53 56.76 36.17 80.83 44.67 86.76 55.37 Qwen3-VL-4B-Instruct 52.00 91.67 39.67 43.27 34.33 89.33 55.00 84.77 61.57 43.17 90.50 47.33 92.39 52.42 Qwen3-VL-8B-Instruct 45.00 93.67 48.67 51.96 42.00 93.67 51.67 78.74 55.16 43.50 93.67 50.17 89.37 53.56 Qwen3-VL-32B-Instruct 55.67 93.33 37.67 40.36 46.00 92.67 46.67 84.57 50.36 50.83 93.00 42.17 92.28 45.36 Qwen3-VL-2B-Thinking 58.00 91.33 33.33 36.50 48.33 93.00 44.67 77.42 48.03 53.17 92.17 39.00 88.31 42.26 Qwen3-VL-4B-Thinking 55.00 92.67 37.67 40.65 47.33 95.00 47.67 84.18 50.18 51.17 93.83 42.67 92.09 45.41 Qwen3-VL-8B-Thinking 53.00 93.33 40.33 43.21 50.00 95.67 45.67 89.33 47.74 51.50 94.50 43.00 94.67 45.47 Qwen3-VL-32B-Thinking 56.37 94.57 38.20 40.40 60.46 93.72 33.26 83.07 35.49 58.41 94.15 35.73 91.53 37.94 Figure 7: Model size vs. performance metrics for the MC task on CDH-Bench. The plot shows the scaling behavior of CF-Acc, CF AD, and CS-Acc for Qwen3-VL Instruct and Thinking models. Figure 8: Model size vs. performance metrics for the QA task on CDH-Bench. The plot shows the scaling behavior of CF-Acc, CF AD, and CS-Acc for Qwen3-VL Instruct and Thinking models. lev el, the most dif ﬁcult cases concentrate around biologi- cal part counting, physical-state anomalies, and causal rev er- sal, indicating that both ﬁne-grained perception and higher- lev el semantic reasoning contribute to failure. Our controlled Qwen3-VL study further suggests that reasoning-oriented variants generally improve robustness to CDH, especially at small and medium scales. Howe ver , larger models do not re- liably eliminate prior-driv en normalization, and the hardest subcategory-le v el bottlenecks remain visible ev en in large- scale systems. Overall, CDH-Bench provides a focused diagnostic for vi- sual ﬁdelity under visual evidence–commonsense conﬂict and highlights a central limitation of current VLMs: they often know what is commonsense more reliably than they report what is actually present. By providing a rigorous framework for unambiguous error attribution, CDH-Bench serves as a critical tool for identifying hidden reliability gaps and guiding the dev elopment of multimodal systems that remain faithful to visual evidence e v en when it conﬂicts with learned priors. References [ 1 ] Manoj Acharya, Kushal Kaﬂe, and Christopher Kanan. T allyqa: Answering complex counting questions. In Pr oceedings of the AAAI Conference on Artiﬁcial Intel- ligence , v olume 33, pages 8076–8084, 2019. [ 2 ] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar - garet Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi P arikh. VQA: V isual question answering. In Pr o- ceedings of the IEEE International Conference on Com- puter V ision , pages 2425–2433, 2015. [ 3 ] Shuai Bai, Y uxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, W ei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl tech- nical report. arXiv preprint , 2025. [ 4 ] Long Chen, Xin Y an, Jun Xiao, Hanwang Zhang, Shil- iang Pu, and Y ueting Zhuang. Counterfactual samples synthesizing for robust visual question answering. In Pr oceedings of the IEEE/cvf Confer ence on Computer V ision and P attern Recognition , pages 10800–10809, 2020. [ 5 ] T ianrui Guan, Fuxiao Liu, Xiyang W u, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun W ang, Lichang Chen, Furong Huang, Y aser Y acoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hal- lucination and visual illusion in large vision-language models. In Pr oceedings of the IEEE/cvf Confer ence on Computer V ision and P attern Recognition , pages 14375–14385, 2024. [ 6 ] Drew A Hudson and Christopher D Manning. GQA: A ne w dataset for real-world visual reasoning and com- positional question answering. In Pr oceedings of the IEEE/cvf Conference on Computer V ision and P attern Recognition , pages 6700–6709, 2019. [ 7 ] Justin Johnson, Bharath Hariharan, Laurens V an Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Pr oceed- ings of the IEEE/cvf Confer ence on Computer V ision and P attern Recognition , pages 2901–2910, 2017. [ 8 ] Bohao Li, Rui W ang, Guangzhi W ang, Y uying Ge, Y ix- iao Ge, and Y ing Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv pr eprint arXiv:2307.16125 , 2023. [ 9 ] Y ifan Li, Y ifan Du, Kun Zhou, Jinpeng W ang, W ayne Xin Zhao, and Ji-Rong W en. Evaluating object hallucination in large vision-language models. In Pro- ceedings of the 2023 Confer ence on Empirical Methods in Natural Languag e pr ocessing , pages 292–305, 2023. [ 10 ] Y uan Liu, Haodong Duan, Y uanhan Zhang, Bo Li, Songyang Zhang, W angbo Zhao, Y ike Y uan, Jiaqi W ang, Conghui He, Ziwei Liu, et al. MMbench: Is your multi-modal model an all-around player? In Eu- r opean Confer ence on Computer ision , pages 216–233. Springer , 2024. [ 11 ] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Pr oceedings of the IEEE/cvf Conference on Com- puter V ision and P attern Recognition , pages 3195– 3204, 2019. [ 12 ] Khoi Pham, Kushal Kaﬂe, Zhe Lin, Zhihong Ding, Scott Cohen, Quan T ran, and Abhinav Shriv asta v a. Learning to predict visual attributes in the wild. In Pr oceedings of the IEEE/cvf Confer ence on Computer V ision and P at- tern Recognition , pages 13018–13028, 2021. [ 13 ] Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, T re vor Darrell, and Kate Saenko. Object hallucination in image captioning. In Pr oceedings of the 2018 Confer- ence on Empirical Methods in Natural Language Pr o- cessing , pages 4035–4045, 2018. [ 14 ] Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Y ikang Shen, Chuang Gan, Liangyan Gui, Y u-Xiong W ang, Y iming Y ang, et al. Aligning large multimodal models with factually augmented rlhf. In F indings of the Association for Computational Lin- guistics: A CL 2024 , pages 13088–13110, 2024. [ 15 ] T ristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina W illiams, Douwe Kiela, and Candace Ross. W inoground: Probing vision and language mod- els for visio-linguistic compositionality . In Pr oceedings of the IEEE/cvf Confer ence on Computer V ision and P attern Recognition , pages 5238–5248, 2022. [ 16 ] An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bo wen Y u, Chang Gao, Chen- gen Huang, Chenxu Lv , et al. Qwen3 technical report. arXiv pr eprint arXiv:2505.09388 , 2025. A Case Study: Counting Anomaly (Pair 5) A.1 Case Description This case belongs to the Counting Anomalies category (Subcate gory: Body Parts ). The image pair compares a normal person showing hands (Commonsense) with an anomalous v ersion where one hand has an e xtra ﬁnger (Counterfactual). (a) Commonsense: Both hands have 5 ﬁngers. (b) Counterf actual: Left hand has 5 ﬁngers, right hand has 6 ﬁngers. Figure 9: Images for Pair 5. The anomaly is an extra ﬁnger on the right hand (person ´ s perspectiv e). A.2 Model Responses Qwen3-VL-2B-Thinking Commonsense Images in the Binary QA Setting Question: Does the image show both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers? Status: ✓ Correct Response: So, let’ s check the image. The person has both hands visible. Each hand has 5 ﬁngers. The left hand and right hand both hav e 5 ﬁngers each. The question is asking if it shows both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers. W ait, the question is phrased as ”Does the image show both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers?” So we need to see if the image has both hands with 5 ﬁngers each, and that it’ s not the case that left has 5 and right has 6. But in the image, both hands hav e 5 ﬁngers each. So the statement is ”both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers”. W ait, the question is a bit confusing. Let’ s parse it. The question is: Does the image show both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers? So we need to check if the image has both hands with 5 ﬁngers each, and that it’... [T runcated for length] Counterfactual Images in the Binary QA Setting Question: Does the image show both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers? Status: ✗ Incorrect Thinking Process: So, let’ s c heck the imag e. The person is showing both hands. Each hand has 5 ﬁnger s. Let’s count: left hand (fr om the viewer’ s perspective, the right side of the image) has 5 ﬁngers, and the right hand (left side of the image) also has 5 ﬁngers. W ait, the question says ”not the left hand with 5 ﬁngers and right hand with 6 ﬁngers”. So we need to see if both hands have 5 ﬁngers each. In the image, both hands ar e showing 5 ﬁngers eac h. So the answer is yes. Final Answer: yes Commonsense Images in the Multiple-Choice QA Setting Question: What is unusual about the hands in this image? A. Left hand has 5 ﬁngers and right hand has 6 ﬁngers B. Both hands hav e 5 ﬁngers C. Both hands hav e 6 ﬁngers D. Left hand has 6 ﬁngers and right hand has 4 ﬁngers Status: ✓ Correct Thinking Process: Got it, let’ s look at the image. The person is showing their hands, and we need to count the ﬁngers. Let’s check each hand. The right hand: let’ s count. Fr om the image, the right hand has 5 ﬁngers? W ait, no, wait. W ait, the left hand: let’s see. W ait, the right hand—wait, the person’ s right hand (fr om the viewer’ s perspective) is on the left side of the image. W ait, no, the person is facing forward. So the right hand (the one on the right side of the imag e) is the person’ s right hand. Let’s count the ﬁngers. The right hand: how many ﬁngers? Let’s see. The right hand has 5 ﬁngers? W ait, no, wait. W ait, the left hand (the one on the left side of the image) is the person’ s left hand. Let’s count the left hand: the left hand has 5 ﬁngers? W ait, no, wait. W ait, the image shows both hands. Let’s count each hand. The right hand (person’ s right) has 5 ﬁngers? W ait, no, wait. W ait, the person’s right hand: let’s count. The right hand has 5 ﬁngers? W ait, no, wait. W ait, the left hand (pe... [T runcated for length] Final Answer: B Counterfactual Images in the Multiple-Choice QA Setting Question: What is unusual about the hands in this image? A. Left hand has 5 ﬁngers and right hand has 6 ﬁngers B. Both hands hav e 5 ﬁngers C. Both hands hav e 6 ﬁngers D. Left hand has 6 ﬁngers and right hand has 4 ﬁngers Status: ✗ Incorrect Response: Got it, let’ s look at the image. The person is showing their hands. Let’ s count the ﬁngers. On the left hand, how many ﬁngers? Let’ s see: thumb, index, middle, ring, pinky . W ait, the left hand in the image—wait, no, the right hand and left hand. W ait, the right hand: let’ s count. The right hand has ﬁngers: thumb, index, middle, ring, pinky—so 5? W ait, no, wait. W ait, the left hand: let’ s check. W ait, the person’ s hands—w ait, the right hand (from the viewer’ s perspecti ve) has 5 ﬁngers? W ait, no, wait. W ait, the image: the right hand (the one on the right side of the image) has 5 ﬁngers? W ait, no, wait. W ait, the left hand (left side of the image) has 5 ﬁngers? W ait, no, wait. W ait, let’ s count. The right hand: thumb, index, middle, ring, pinky—so 5. The left hand: thumb, index, middle, ring, pinky—wait, no, wait. W ait, no, wait. W ait, the left hand in the image: let’ s count. The left hand (the one on the left side of the image) has 5 ﬁngers? W ait, no, wait. W ait, no, wait. W ait, ... [Truncated for length] Qwen3-VL-30B-A3B-Thinking Commonsense Images in the Binary QA Setting Question: Does the image show both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers? Status: ✓ Correct Thinking Process: Got it, let’s chec k the ima ge. The person has both hands raised. Let’s count the ﬁngers on each hand. Left hand: 5 ﬁngers, right hand: 5 ﬁnger s. So the question is asking if both hands have 5 ﬁngers eac h, not one hand with 6. So the answer is yes. Final Answer: yes Counterfactual Images in the Binary QA Setting Question: Does the image show both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers? Status: ✗ Incorrect Thinking Process: So, let’s c heck the image. The person has both hands raised. Let’s count the ﬁngers on each hand. Left hand: 5 ﬁngers. Right hand: 5 ﬁngers. So both have 5 each. The question is asking if it’s showing both with 5, not left 5 and right 6. So the answer is yes. Final Answer: yes Commonsense Images in the Multiple-Choice QA Setting Question: What is unusual about the hands in this image? A. Left hand has 5 ﬁngers and right hand has 6 ﬁngers B. Both hands hav e 5 ﬁngers C. Both hands hav e 6 ﬁngers D. Left hand has 6 ﬁngers and right hand has 4 ﬁngers Status: ✗ Incorrect Thinking Process: Got it, let’s look at the image. The person’ s hands—each hand should have 5 ﬁngers, but wait, no. W ait, the right hand (from the viewer’ s perspective, the person’ s left hand) and left hand. W ait, no, let’ s count. The right hand (the per son’ s right hand) has 5 ﬁngers, and the left hand (person’ s left) also has 5? W ait no, wait. W ait, the image shows both hands. W ait, no, the question is what’s unusual. W ait, no, in the image, each hand has 5 ﬁngers? W ait no, wait, no—wait, no, the correct answer is C? W ait no. W ait, no, let’s check. W ait, the options: A says left has 5, right has 6. B both 5. C both 6. D left 6, right 4. W ait, no, in the image, the hands—wait, no, the person’s hands. W ait, no, the image shows that each hand has 5 ﬁnger s? W ait no, no, wait, no. W ait, no, the correct answer is C? W ait, no, wait. W ait, no, the image—wait, no, the person’ s hands. W ait, no, the answer is C? W ait, no, let’s count. The right hand (the person’ s right hand) has 5 ﬁngers, left hand also 5. B... [T runcated for length] Final Answer: C Counterfactual Images in the Multiple-Choice QA Setting Question: What is unusual about the hands in this image? A. Left hand has 5 ﬁngers and right hand has 6 ﬁngers B. Both hands hav e 5 ﬁngers C. Both hands hav e 6 ﬁngers D. Left hand has 6 ﬁngers and right hand has 4 ﬁngers Status: ✗ Incorrect Thinking Process: Got it, let’ s look at the image. The person’s hands—each hand should have 5 ﬁngers, but wait, no. W ait, the left hand (fr om the viewer’ s perspective, the person’ s right hand) and right hand. W ait, no, let’s count. The left hand (the person’s left hand) has 6 ﬁngers? W ait, no. W ait, the image shows both hands. Let’s check: each hand—wait, the left hand (the person’s left) has 6 ﬁngers? W ait, no, the right hand (person’s right) and left hand. W ait, the correct answer: the left hand (person’s left) has 6 ﬁngers? W ait, no. W ait, the options: A says left hand 5, right 6. Let’s count. The right hand (the person’ s right hand) has 6 ﬁngers? W ait, no. W ait, the image—each hand. W ait, the person’ s left hand (the one on the right side of the image) has 6 ﬁngers? W ait, no. W ait, the correct answer is C? No. W ait, no. W ait, the hands: let’s count. The left hand (the person’ s left) has 6 ﬁngers? W ait, no. W ait, the image shows both hands. Let’s count each hand. The right hand (person’s right)... [T runcated for length] Final Answer: C Qwen3-VL-32B-Thinking Commonsense Images in the Binary QA Setting Question: Does the image show both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers? Status: ✓ Correct Thinking Process: So, let’ s c heck the image . The person has both hands raised. Let’ s count the ﬁng ers on eac h hand. Left hand (fr om the viewer’ s perspective, the right side of the image) – wait, no, the person’ s left and right. W ait, the person’ s right hand (our left) and left hand (our right). Let’ s count: right hand (person’ s right) has 5 ﬁngers: thumb, index, middle, ring, pinky . Left hand (person’s left) also has 5 ﬁngers. So both hands have 5 ﬁngers eac h. The question is asking if it’s both hands with 5 each, not left with 5 and right with 6. So yes, both have 5. So the answer is yes. Final Answer: yes Counterfactual Images in the Binary QA Setting Question: Does the image show both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers? Status: ✗ Incorrect Thinking Process: So, let’s check the image. The person is showing both hands. Let’s count the ﬁngers on each hand. Left hand (from the viewer’s perspective, the right side of the image) – wait, no, the person’ s left and right. W ait, the left hand of the per son (our right) and right hand of the person (our left). Let’s count: right hand (person’s right) has 5 ﬁngers: thumb, index, middle, ring, pinky . Left hand (person’s left) also has 5 ﬁngers. So both hands have 5 ﬁngers each. The question is asking if it’ s both hands with 5 each, not left with 5 and right with 6. So yes, both have 5. So the answer is yes. Final Answer: yes Commonsense Images in the Multiple-Choice QA Setting Question: What is unusual about the hands in this image? A. Left hand has 5 ﬁngers and right hand has 6 ﬁngers B. Both hands hav e 5 ﬁngers C. Both hands hav e 6 ﬁngers D. Left hand has 6 ﬁngers and right hand has 4 ﬁngers Status: ✗ Incorrect Thinking Process: Got it, let’ s look at the image . The person is showing both hands. Let’ s count the ﬁngers on eac h hand. F irst, the right hand (from the viewer’s perspective, which is the person’ s left hand? W ait, no. W ait, the person’s right hand is on the left side of the imag e. Let’s clarify: when the person r aises their hands, the left hand (of the per son) is on the right side of the image , and the right hand (of the person) is on the left side of the image . W ait, let’ s count. Let’s c heck the right hand (the one on the left side of the imag e). How many ﬁngers? Let’s see: the right hand (per son’ s right) has 5 ﬁngers? W ait, no. W ait, the right hand (person’s right) in the image: let’ s count. The right hand (left side of image) has 5 ﬁngers? W ait, no. W ait, the left hand (per son’ s left) is on the right side of the image . Let’s count eac h hand. Looking at the image: the right hand (fr om the person’ s perspective, so the one on the left side of the image) has 5 ﬁngers? W ait, no. W ait, the right... [Truncated for length] Final Answer: A Counterfactual Images in the Multiple-Choice QA Setting Question: What is unusual about the hands in this image? A. Left hand has 5 ﬁngers and right hand has 6 ﬁngers B. Both hands hav e 5 ﬁngers C. Both hands hav e 6 ﬁngers D. Left hand has 6 ﬁngers and right hand has 4 ﬁngers Status: ✓ Correct Thinking Process: Got it, let’ s look at the image . The person is showing both hands. Let’ s count the ﬁngers on eac h hand. F irst, the left hand (fr om the viewer’ s perspective, which is the person’s right hand? W ait, no. W ait, the person’ s left hand is on the left side of the imag e. Let’s chec k: the left hand (person’ s left) has ﬁngers. Let’s count. The left hand (in the ima ge, left side) has 5 ﬁngers? W ait, no. W ait, the right hand (person’ s right) is on the right side of the imag e. Let’s count eac h hand. Looking at the left hand (the one on the left side of the image, which is the person’s left hand): let’ s count. The ﬁngers are: thumb, index, middle, ring, pinky . W ait, but maybe one of them is missing or extra. W ait, the right hand (person’ s right) in the image: let’ s count. The right hand (image right) has 6 ﬁngers? W ait, no. W ait, the left hand (image left) has 5 ﬁngers, and the right hand (image right) has 6? Let’ s check again. W ait, the left hand (the one on the left side of the image) ha... [T runcated for length] Final Answer: A Qwen3-VL-4B-Thinking Commonsense Images in the Binary QA Setting Question: Does the image show both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers? Status: ✓ Correct Thinking Process: So, let’ s look at the image . The person is showing both hands. Let’s count the ﬁngers on each hand. Both hands have 5 ﬁngers each. The left hand (fr om the viewer’s perspective, the person’s right hand) and the right hand (person’ s left hand) are both extended with 5 ﬁngers. So the question is asking if it’ s both hands with 5 ﬁngers eac h, not one with 5 and the other with 6. The answer should be yes. Final Answer: yes Counterfactual Images in the Binary QA Setting Question: Does the image show both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers? Status: ✗ Incorrect Thinking Process: So, let’ s look at the ima ge. The person is showing both hands. Let’s count the ﬁng ers on each hand. Left hand: 5 ﬁnger s. Right hand: 5 ﬁnger s. W ait, the question is asking if it’s both hands with 5 ﬁngers each, not left 5 and right 6. So check each hand. Both hands have 5 ﬁngers. So the answer is yes. Final Answer: yes Commonsense Images in the Multiple-Choice QA Setting Question: What is unusual about the hands in this image? A. Left hand has 5 ﬁngers and right hand has 6 ﬁngers B. Both hands hav e 5 ﬁngers C. Both hands hav e 6 ﬁngers D. Left hand has 6 ﬁngers and right hand has 4 ﬁngers Status: ✗ Incorrect Thinking Process: Got it, let’ s look at the image. The person is showing both hands. Let’s count the ﬁngers on each hand. The left hand: let’ s see, the left hand has 5 ﬁngers. The right hand: wait, the right hand—wait, no, maybe I mixed up left and right. W ait, the person’ s right hand (from the vie wer’ s perspective, the right side of the ima ge) is the one on the right. W ait, no, the per son’ s left hand is on the left side of the image, right hand on the right. Let’s count. Left hand (image left): 5 ﬁngers. Right hand (image right): let’ s count. W ait, the right hand in the imag e—wait, the person’ s right hand is the one on the right side of the ima ge. Let’ s chec k. W ait, the right hand (the one on the right in the image) has 6 ﬁng ers? W ait, no. W ait, maybe the left hand has 6? W ait, no. W ait, let’ s look again. W ait, the problem says ”what’s unusual”. Let’ s count each hand. Left hand (the one on the left side of the ima ge, whic h is the person’ s left hand): how many ﬁngers? Let’ s see, the left hand in th... [Truncated for length] Final Answer: A Counterfactual Images in the Multiple-Choice QA Setting Question: What is unusual about the hands in this image? A. Left hand has 5 ﬁngers and right hand has 6 ﬁngers B. Both hands hav e 5 ﬁngers C. Both hands hav e 6 ﬁngers D. Left hand has 6 ﬁngers and right hand has 4 ﬁngers Status: ✗ Incorrect Thinking Process: Got it, let’s look at the image. The person is showing both hands. Let’s count the ﬁngers on each hand. The left hand: let’ s see, thumb, index, middle, ring, pinky—5 ﬁngers. The right hand: same, thumb, index, middle, ring, pinky—5 ﬁngers? W ait, no, wait. W ait, maybe I miscounted. W ait, the options are A to D. Let’s check again. W ait, the question says what’s unusual. W ait, maybe one hand has an extr a ﬁnger? W ait, no, in the image, both hands look like they have 5 ﬁngers each. But wait, option B says both have 5. But that’s normal. W ait, maybe I’m missing something. W ait, no, the pr oblem is that maybe one hand has 6? W ait, let’s count again. Left hand: 5 ﬁngers. Right hand: 5 ﬁngers. But that’ s normal. W ait, but the options include B as both have 5. But maybe the question is tricking me. W ait, no, wait. W ait, maybe the person is holding their hands in a way that one hand has an extra ﬁnger? W ait, no, looking at the image , both hands have 5 ﬁngers. So why is the question asking... [T runcated for length] Final Answer: B Qwen3-VL-8B-Thinking Commonsense Images in the Binary QA Setting Question: Does the image show both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers? Status: ✓ Correct Thinking Process: So, let’s c heck the image. The person has both hands raised. Let’s count the ﬁngers on each hand. Left hand: 5 ﬁngers. Right hand: 5 ﬁngers. So neither hand has 6 ﬁngers. The question is asking if it’ s both hands with 5 ﬁngers each, not left 5 and right 6. So the answer should be yes. Final Answer: yes Counterfactual Images in the Binary QA Setting Question: Does the image show both hands with 5 ﬁngers each and not the left hand with 5 ﬁngers and right hand with 6 ﬁngers? Status: ✗ Incorrect Thinking Process: So, let’s c heck the image. The person has both hands raised. Let’s count the ﬁngers on each hand. Left hand: 5 ﬁngers. Right hand: 5 ﬁngers. So both have 5 each. The question is asking if it’s showing both with 5, not left 5 and right 6. So the answer is yes. Final Answer: yes Commonsense Images in the Multiple-Choice QA Setting Question: What is unusual about the hands in this image? A. Left hand has 5 ﬁngers and right hand has 6 ﬁngers B. Both hands hav e 5 ﬁngers C. Both hands hav e 6 ﬁngers D. Left hand has 6 ﬁngers and right hand has 4 ﬁngers Status: ✗ Incorrect Response: Got it, let’ s look at the image. The person is holding up both hands. Let’ s count the ﬁngers. W ait, the right hand—wait, no, let’ s check each hand. W ait, the left hand: how many ﬁngers? Let’ s see. The right hand: maybe 6? W ait, no, wait. W ait, the image shows both hands. W ait, the right hand (from the viewer’ s perspective, the person’ s right hand) has 5 ﬁngers? W ait, no, wait. W ait, the person’ s left hand—wait, no, let’ s count. W ait, the right hand (the one on the left side of the image) has 5 ﬁngers? W ait, no, maybe the left hand (on the right side of the image) has 6? W ait, no, let’ s check again. W ait, the options: A says left has 5, right has 6. B both 5. C both 6. D left 6, right 4. W ait, in the image, let’ s count. The right hand (person’ s right, which is left side of image) has 5 ﬁngers? W ait, no, wait. W ait, the person’ s left hand (right side of image) has 6 ﬁngers? W ait, no, maybe I’m miscounting. W ait, no, actually , looking at the image, the right hand (person’ s right) ha... [Truncated for length] Counterfactual Images in the Multiple-Choice QA Setting Question: What is unusual about the hands in this image? A. Left hand has 5 ﬁngers and right hand has 6 ﬁngers B. Both hands hav e 5 ﬁngers C. Both hands hav e 6 ﬁngers D. Left hand has 6 ﬁngers and right hand has 4 ﬁngers Status: ✗ Incorrect Thinking Process: Got it, let’ s look at the image. The person is holding up both hands. Let’ s count the ﬁngers on each hand. The right hand: let’s see, thumb, index, middle, ring, pinky—5 ﬁngers. Left hand: same, thumb, index, middle, ring, pinky—5 ﬁngers. W ait, but the options say something about 6 or 4. W ait, maybe I miscounted. W ait, no, the question is what’ s unusual. W ait, the options: A says left 5, right 6. B both 5. C both 6. D left 6, right 4. But in the image, both hands have 5 ﬁngers. So the unusual part would be if they didn’t, but the question is asking what’s unusual. W ait, but the options are the choices. W ait, the problem is to pick which option is correct. W ait, the image shows both hands with 5 ﬁngers. So option B is ”Both hands have 5 ﬁngers”. But is that unusual? W ait, no, normally hands have 5 ﬁngers. W ait, but maybe the question is tricking us. W ait, no, the options ar e the possible answers. W ait, the question is ”What is unusual about the hands in this image?” So if both ha... [Truncated for length] Final Answer: B

CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment