CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos
Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by temporal-semantic reasoning: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework’s training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.
💡 Research Summary
CoT‑RVS introduces a novel, training‑free framework for Reasoning Video Object Segmentation (Reasoning VOS) that leverages the zero‑shot Chain‑of‑Thought (CoT) capability of pre‑trained multimodal large language models (MLLMs). The core challenge in Reasoning VOS is to generate a mask sequence for a video given a complex, often implicit textual query that may be temporally sensitive (e.g., “Which player makes a three‑point shot?”). Existing methods fine‑tune MLLMs for segmentation but struggle with temporal reasoning and require costly training pipelines.
CoT‑RVS addresses these issues through a three‑stage pipeline: (1) Keyframe Selection via CoT – a set of candidate frames is uniformly sampled from the video. An MLLM agent (e.g., GPT‑4o or Gemma‑3) receives the query and each candidate frame, then engages in a self‑generated question‑answer sequence. The questions progress from generic visual description (“What is visible in this frame?”) to temporal relevance (“Is this a better keyframe for the query?”) and finally to query satisfaction (“Does any object here meet the query?”). By reasoning across frames, the model outputs a list of target instances, each paired with a selected keyframe and a concise object description. This step requires no fine‑tuning and works with closed‑source models.
(2) Reasoning Image Segmentation – for each selected keyframe, a conventional image segmentation model is prompted with the object description to produce a high‑quality mask (the “key mask”).
(3) Video Processor / Tracker – the key masks are propagated across the entire video using a tracking module (optical‑flow‑based or transformer‑based), yielding temporally consistent instance‑level mask sequences.
A distinctive extension handles online streaming video: as new frames arrive, the CoT module re‑evaluates the current keyframe set, potentially replacing a keyframe if a better‑matching object emerges, and the tracker updates accordingly. This dynamic re‑selection enables continuous adaptation, a capability rarely explored in prior Reasoning VOS systems.
The authors evaluate CoT‑RVS on both referring VOS benchmarks (MeV‑iS, Refer‑DAVIS‑17) and reasoning VOS datasets (ReVOS, ReasonVOS). Across explicit and implicit queries, CoT‑RVS outperforms state‑of‑the‑art baselines by a substantial margin (average gains of 7–12 percentage points in J&F and mIoU). Qualitative visualizations demonstrate that the framework correctly identifies temporally localized actions (e.g., a fast‑moving basketball player) and selects keyframes where the target is most observable, even under occlusion or rapid motion.
Key contributions are: (1) the first application of zero‑shot CoT reasoning to temporal‑semantic correlation in video segmentation, (2) a fully training‑free, modular architecture compatible with any pre‑trained MLLM, and (3) an online adaptation mechanism that updates keyframes during inference. The work opens avenues for further research on more efficient frame sampling, multi‑object simultaneous CoT reasoning, and lightweight CoT‑enabled models for real‑time deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment