VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.
💡 Research Summary
VisionCoach introduces a novel training‑time visual‑prompt‑guided reinforcement learning (RL) framework for video question answering (VQA) that significantly improves spatio‑temporal grounding while preserving a lightweight inference pipeline. The core problem addressed is that existing video reasoning models either rely heavily on language priors, producing hallucinated explanations, or depend on external perception tools (e.g., cropping, zoom‑in) during inference, which adds computational overhead and complexity. VisionCoach tackles this by using visual prompts only during training to “coach” the model toward better grounding, and then internalizing this capability through self‑distillation so that no prompts are needed at test time.
The system consists of two main components. First, the Visual Prompt Selector (VP‑SELECTOR) predicts the most effective visual prompt type (e.g., darkening, red circle overlay, bounding‑box highlight) conditioned on the input video and question. To train this selector, the authors construct a prompt‑candidate dataset using a proxy reasoner that evaluates the impact of each prompt on answer correctness. VP‑SELECTOR is then fine‑tuned with supervised learning to select the optimal prompt for any given input. Second, the Spatio‑Temporal Reasoner (ST‑REASONER) is a policy network πθ optimized with a policy‑gradient RL algorithm (GSPO). For each training example, an initial set of roll‑outs is performed; if the average reward falls below a predefined threshold, the example is marked as “hard.” For hard samples, VP‑SELECTOR supplies a visual prompt that is applied to key frames, producing a prompted version of the video. The prompted input is fed to the policy, which receives grounding‑aware rewards that combine standard answer correctness with two novel components: (1) object‑identity consistency, encouraging the model to maintain the same object ID across time, and (2) multi‑region bounding‑box IoU, encouraging high overlap between predicted and ground‑truth boxes across several objects. These rewards explicitly enforce alignment between the reasoning trajectory and the visual evidence.
Self‑distillation bridges the gap between prompted and raw inputs. The prompted policy acts as a teacher, while the same policy processing the unprompted video acts as a student. A KL‑divergence loss forces the student to mimic the teacher’s action distribution, thereby transferring the grounding improvements induced by visual prompting into the model’s internal representations. After distillation, inference requires only a single forward pass on the raw video, without any external tool calls or visual prompts.
Extensive experiments on six benchmarks—V‑STAR (spatio‑temporal reasoning), VideoMME, World‑Sense, VideoMMMU, PerceptionTest (general video understanding), and Charades‑STA (temporal grounding)—show that VisionCoach achieves state‑of‑the‑art performance. On V‑STAR, it improves mean answer matching (mAM) by +15.0% and mean localization grounding metric (mLGM) by +25.1% over the previous best, surpassing GPT‑4o and Qwen2.5‑VL‑7B. Across the other datasets, it consistently outperforms prior open‑source video‑LLMs by 2–5 percentage points. Ablation studies confirm that both the visual‑prompt selector and the object‑aware grounding rewards are essential; removing either component degrades performance noticeably. Prompt‑type analysis reveals that “darkening” and “red‑circle” prompts yield the highest gains, while an oracle that always picks the best prompt achieves the upper bound, highlighting the importance of adaptive prompting.
Key contributions of the paper are: (1) an input‑adaptive RL framework that uses visual prompting as training‑time guidance to improve grounding, while maintaining a prompt‑free, single‑pass inference; (2) a novel grounding reward that incorporates object identity consistency and multi‑box IoU; (3) a visual‑prompt selector trained via a proxy‑reasoner pipeline to predict appropriate prompts per example; and (4) a self‑distillation scheme that internalizes the benefits of prompting, eliminating the need for external perception tools at test time. VisionCoach demonstrates that targeted, training‑time perception control can endow video‑LLMs with reliable, grounded reasoning capabilities without sacrificing efficiency.
Comments & Academic Discussion
Loading comments...
Leave a Comment