RealCQA-V2: A Diagnostic Benchmark for Structured Visual Entailment over Scientific Charts
Multimodal reasoning models often produce fluent answers supported by seemingly coherent rationales. Existing benchmarks evaluate only final-answer correctness. They do not support atomic visual entailment verification of intermediate steps, especially visual compositional logic. This limitation is especially acute in scientific chart understanding, where answers depend on deterministically grounded visual semantics such as axes, legends, and quantitative relations. We introduce RealCQA-V2, a large-scale benchmark that reformulates chart question answering as Visual Premise Proving (VPP): a structured logical entailment task over chart-grounded visual predicates. Each question is deconstructed into manually curated, atomic premises grounded in chart elements (axes, legends, marks, and quantitative relations), yielding executable reasoning chains rather than free-form textual rationales. These premises form compositional reasoning chains, enabling verification at the level of individual visual statements and complete reasoning sequences. We introduce chain-level metrics that measure both full logical validity (AccVPP) and partial reasoning progress within failed chains (DCP), extending beyond traditional VQA accuracy. Baseline evaluations across representative LVLMs reveal a consistent local-global reasoning gap: models often verify many individual premises correctly while failing to preserve coherence across the full chain. RealCQA-V2 establishes a reproducible benchmark for structured visual entailment over real scientific charts and enables rigorous diagnosis of multimodal reasoning beyond answer-only evaluation.
💡 Research Summary
RealCQA‑V2 introduces a novel diagnostic benchmark that reframes scientific chart question answering as a Visual Premise Proving (VPP) task. Instead of evaluating only the final answer, the authors decompose each chart‑based question into a sequence of manually curated atomic premises that are directly grounded in chart elements such as axes, legends, tick marks, and data marks. These premises capture four distinct reasoning levels: Structural Premises (verifying chart grammar), Data Premises (retrieving exact numeric values), Reasoning Premises (comparisons and relational judgments), and Math Premises (derived quantities like differences or ratios). For each premise the benchmark provides three aligned representations—natural‑language text, a first‑order logic (FOL) predicate, and an abstract syntax tree (AST)—enabling evaluation across language models, symbolic engines, and graph‑based systems.
The dataset is built on 28 K real scientific figures harvested from PubMed Central and richly annotated via the ChartInfo pipeline. From these figures, 1.7 M questions are generated, yielding over 5 M premise‑conclusion pairs with an average chain depth of 9–11 steps. Premise generation follows a deterministic pipeline (Structure → Data → Reasoning → Math) and is first drafted by GPT‑4o, then rigorously verified by human annotators to ensure logical correctness and exact alignment with the underlying chart annotations. The resulting chains are also expressed as directed acyclic graphs (G_Q), exposing explicit dependency edges between premises.
Two chain‑level metrics are introduced. AccVPP measures full‑chain validity: a model receives a positive score only if every premise in the chain is correctly entailed by the chart. DCP (Depth of Correct Premises) quantifies partial progress by reporting the proportion of premises that a model gets right within a failed chain. These metrics expose a “local‑global reasoning gap”: many state‑of‑the‑art large vision‑language models (LVLMs) such as GPT‑4o, Gemini‑2.5, and InternVL‑3 achieve high premise‑level accuracies (80‑95 %) on structural and data premises but drop to near‑zero AccVPP because they fail to maintain logical consistency across reasoning and math premises. For example, models often correctly identify axis labels and retrieve numeric values yet mis‑apply comparisons or arithmetic, leading to incoherent final conclusions.
The authors also explore how VPP can be used for training. By providing premise‑level supervision, models can be fine‑tuned to predict binary entailment judgments for each premise, which in preliminary experiments improves both final answer accuracy and chain consistency. Moreover, the multi‑representation design allows hybrid approaches: a language model can generate the natural‑language premise, a symbolic reasoner can verify the corresponding FOL predicate, and a graph‑matching algorithm can check the AST structure, facilitating research on neuro‑symbolic integration.
In summary, RealCQA‑V2 offers the first large‑scale, real‑world chart benchmark that moves beyond answer‑only evaluation to a rigorous, verifiable visual entailment framework. By grounding every intermediate inference in deterministic chart semantics and providing fine‑grained metrics, it enables precise diagnosis of where multimodal models succeed (local perception and data extraction) and where they fail (global logical composition). The benchmark is poised to drive the next generation of multimodal systems that combine fluent language generation with faithful, step‑by‑step visual reasoning.
Comments & Academic Discussion
Loading comments...
Leave a Comment