V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
Large Vision-Language Models (LVLMs) have made significant strides in the field of video understanding in recent times. Nevertheless, existing video benchmarks predominantly rely on text prompts for evaluation, which often require complex referential language and diminish both the accuracy and efficiency of human model interaction in turn. To address this limitation, we propose V2P-Bench, a robust and comprehensive benchmark for evaluating the ability of LVLMs to understand Video Visual Prompts in human model interaction scenarios. V2P-Bench consists of 980 videos and 1172 well-structured high-quality QA pairs, each paired with manually annotated visual prompt frames. The benchmark spans three main tasks and twelve categories, thereby enabling fine-grained, instance-level evaluation. Through an in-depth analysis of current LVLMs, we identify several key findings: 1) Visual prompts are both more model-friendly and user-friendly in interactive scenarios than text prompts, leading to significantly improved model performance and enhanced user experience. 2) Models are reasonably capable of zero-shot understanding of visual prompts, but struggle with spatiotemporal understanding. Even o1 achieves only 71.8%, far below the human expert score of 88.3%, while most open-source models perform below 60%. 3) LVLMs exhibit pervasive Hack Phenomena in video question answering tasks, which become more pronounced as video length increases and frame sampling density decreases, thereby inflating performance scores artificially. We anticipate that V2P-Bench will not only shed light on these challenges but also serve as a foundational tool for advancing human model interaction and improving the evaluation of video understanding.
💡 Research Summary
The paper introduces V2P‑Bench, a novel benchmark designed to evaluate large vision‑language models (LVLMs) on video understanding tasks that involve visual prompts rather than traditional text‑only queries. Recognizing that text prompts often require complex referential language and can lead to misalignment between user intent and model interpretation—especially in multi‑object or temporally intricate scenes—the authors propose a human‑model interaction framework where a single frame is annotated with a visual cue (e.g., rectangle, mask, ellipse, scribble, point, arrow, or “SoM”) that directly points to the target of interest.
The dataset aggregates 980 videos drawn from twelve existing video collections, reorganized into twenty semantic categories and three duration tiers (short, medium, long). For each of the 1 172 high‑quality question‑answer (QA) pairs, annotators manually place one visual prompt on a specific frame, ensuring that the prompt precisely encodes the spatial and temporal focus required for the question. The benchmark defines three overarching tasks—Basic Perception, Temporal Understanding, and High‑Level Reasoning—further divided into twelve orthogonal dimensions such as Object Attribute, Human Attribute, Object Direction, Forward/Reverse Temporal, Action Sequence, Spatial Relationship, General Counting, Causal Relationship, Plot Understanding, and Counterfactual Inference. This fine‑grained taxonomy enables a nuanced assessment of a model’s ability to reason about static attributes, dynamic changes, and abstract logical relations.
To guarantee data integrity, the authors employ a two‑stage filtering pipeline. First, a “blind LLM” stage uses powerful closed‑source models (GPT‑4o and Gemini‑2.5‑Pro) to discard any QA that can be answered correctly without viewing the video, thereby eliminating commonsense shortcuts. Second, rule‑based and manual reviews remove biased answer distributions and ensure balanced multiple‑choice options (approximately 25 % each).
Evaluation covers fifteen LVLMs: three closed‑source (o1, GPT‑4o, Gemini‑2.5‑Pro) and twelve open‑source models (including LLaVA‑OneVision, InternVL3‑8B, mPLUG‑Owl3‑7B, LLaVA‑Video, MiniCPM‑V, Qwen2.5‑VL, MiMo‑VL, LLaVA‑NeXT variants). For each model, video frames are sampled according to the model’s context length (typically 128 frames, 64 for the largest closed‑source models). The visual prompt frame is appended after the video stream, mimicking a realistic interaction where a user first shows the video and then highlights the region of interest.
Results reveal several key findings. First, even the strongest model, o1, attains only 71.8 % overall accuracy, well below the human expert ceiling of 88.3 %. Performance gaps are especially pronounced in temporal dimensions such as Object Direction and Action Sequence, indicating that current LVLMs still struggle with fine‑grained spatiotemporal reasoning. Most open‑source models fall below 60 % accuracy, underscoring a substantial gap between research prototypes and commercial systems.
Second, the authors identify a “Hack Phenomenon” whereby longer videos and lower frame‑sampling densities enable models to exploit superficial visual patterns or statistical regularities rather than genuine understanding. This leads to artificially inflated scores that do not reflect true comprehension. The phenomenon becomes more severe as video length increases, highlighting a critical weakness in existing evaluation protocols that rely on sparse sampling.
Third, a user‑experience study demonstrates that visual prompts dramatically improve interaction efficiency: participants generate queries faster, achieve higher answer correctness, and report greater satisfaction compared with text‑only prompts. The visual cue eliminates the need for elaborate linguistic descriptions, aligning more closely with natural human cognition.
In summary, V2P‑Bench establishes a comprehensive, high‑quality benchmark that foregrounds visual prompting as a more user‑friendly and model‑friendly modality for video‑language tasks. It exposes current LVLMs’ limitations in spatiotemporal understanding, warns of evaluation artifacts caused by sparse sampling, and provides a solid foundation for future research aimed at improving visual prompt encoding, frame selection strategies, and robust evaluation methodologies for long‑form, multi‑object video reasoning.
Comments & Academic Discussion
Loading comments...
Leave a Comment