CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models
We introduce CompareBench, a benchmark for evaluating visual comparison reasoning in vision-language models (VLMs), a fundamental yet understudied skill. CompareBench consists of 1000 QA pairs across four tasks: quantity (600), temporal (100), geometric (200), and spatial (100). It is derived from two auxiliary datasets that we constructed: TallyBench (2000 counting images with QA) and HistCaps (515 historical images with bilingual captions). We evaluate both closed-source APIs (OpenAI, Gemini, Claude) and open-source models (Qwen2.5-VL and Qwen3-VL series). Results show clear scaling trends but also reveal critical limitations: even the strongest models consistently fail at temporal ordering and spatial relations, and they often make mistakes in basic counting and geometric comparisons that are trivial for humans. These findings demonstrate that visual comparison remains a systematic blind spot for current VLMs. By providing controlled, diverse, and diagnostic evaluation, CompareBench establishes a foundation for advancing more reliable multimodal reasoning.
💡 Research Summary
CompareBench is a newly introduced benchmark designed to rigorously evaluate visual comparison reasoning in vision‑language models (VLMs), a core cognitive skill that has received little systematic attention despite its importance for everyday perception and higher‑level reasoning. The benchmark consists of 1,000 question‑answer pairs divided into four complementary sub‑tasks: quantity comparison (600 items), temporal ordering (100 items), geometric property comparison (200 items), and spatial relation reasoning (100 items). These tasks are built on two auxiliary resources created by the authors: TallyBench, a collection of 2,000 real‑world images each paired with a counting question covering roughly 50 fine‑grained object categories (animals, plants, people, electronics, etc.); and HistCaps, a curated set of 515 historical photographs annotated with bilingual (English–Chinese) captions and explicit temporal tags spanning several centuries.
From these resources, CompareBench constructs controlled yet diverse evaluation scenarios. In the quantity sub‑benchmark (CompareTallyBench), four images are arranged in a 1600 × 1600 grid and the model must select the image containing the greatest number of instances of a specified object class. The temporal sub‑benchmark (CompareTemporalBench) presents four historically tagged images in a 1920 × 1440 grid and asks which scene occurred earliest. The geometric sub‑benchmark (CompareGeometryBench) shows a single image with four labeled objects (A–D) and asks for the longest/shortest/ thickest/ widest/ widest‑diameter object, requiring pure visual measurement without reliance on semantic cues. Finally, the spatial sub‑benchmark (CompareSpatialBench) asks about depth (closest to camera) or vertical height (highest above ground) among four labeled points or objects. All tasks use a unified instruction template that forces the model to output a single choice (A–D) without any additional text, thereby eliminating confounding factors such as free‑form explanations.
The authors evaluate a broad spectrum of VLMs, including closed‑source APIs (OpenAI’s GPT‑4o, GPT‑5 series, Google Gemini 2.5 variants, Anthropic Claude models) and open‑source models (Qwen2.5‑VL and Qwen3‑VL series ranging from 3 B to 72 B parameters). Results reveal clear scaling trends: larger models generally achieve higher overall accuracy. However, critical blind spots emerge. While the best-performing models (e.g., Gemini 2.5 Pro, OpenAI o4‑mini) reach 80–90 % accuracy on the quantity sub‑benchmark, their performance on temporal and spatial tasks hovers around 60–70 % and often falls below 50 % for smaller models. Even the most advanced GPT‑5 variants, which attain >78 % overall, still struggle with temporal ordering (≈70 % at best) and spatial reasoning (≈65 % at best). Geometric comparison is relatively easier, with top models achieving 80–86 % accuracy, yet still far from human‑level performance (~95 %).
Error analysis shows systematic patterns: models frequently conflate thickness with length, misinterpret reflections or shadows as separate objects, and rely on superficial visual cues rather than the provided temporal metadata when answering historical ordering questions. In spatial tasks, models often misjudge depth cues, selecting objects that appear larger rather than those truly closer to the camera. These findings indicate that current VLMs lack robust internal representations of time and 3‑D space, despite impressive gains in captioning and open‑ended VQA.
The paper positions CompareBench as a complementary diagnostic tool to existing benchmarks such as VQA, GQA, CLEVR, and holistic suites like MMBench or MM‑Vet, which either focus on recognition, synthetic reasoning, or broad multimodal abilities but do not isolate comparative reasoning. By releasing the full dataset, code, and prompt templates, the authors enable the community to benchmark future models, explore training strategies (e.g., contrastive pre‑training on comparison pairs, multi‑stage alignment for temporal cues), and develop specialized chain‑of‑thought prompting for comparison tasks. The authors suggest future directions including expanding the temporal span of HistCaps, incorporating depth maps or 3‑D reconstructions to strengthen spatial reasoning, and integrating explicit comparison objectives into the loss functions of VLMs.
In summary, CompareBench fills a critical gap in multimodal evaluation by providing a real‑world, fine‑grained, and systematically controlled benchmark for visual comparison reasoning. The empirical study demonstrates that, while scaling improves overall performance, current VLMs remain markedly deficient in temporal and spatial comparison—areas essential for scientific, educational, and decision‑making applications. The benchmark thus offers a concrete target for the next generation of vision‑language systems to achieve more human‑like comparative cognition.
Comments & Academic Discussion
Loading comments...
Leave a Comment