DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement learning algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) remains underexplored. In this paper, we explore GRPO and identify two issues that hinder effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function as a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as clipping and min operations. This directly aligns the model with the advantages, providing guidance to prefer better outputs. The difficulty-aware data augmentation strategy augments input prompts/videos to target solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.


💡 Research Summary

DeepVideo‑R1 addresses the persistent challenges of applying reinforcement‑learning‑based fine‑tuning to video large language models (VideoLLMs). While Group Relative Policy Optimization (GRPO) has shown promise for text‑only LLMs, its direct transfer to multimodal video‑text models suffers from two critical issues: (1) reliance on stabilizing mechanisms such as clipping and minimum‑value operations, which suppress gradients and slow convergence, and (2) the vanishing‑advantage problem, where samples that are too easy or too hard produce near‑zero advantage, eliminating useful learning signals.
The authors propose two complementary innovations. First, Reg‑GRPO reformulates the GRPO objective as a regression problem. Instead of using a PPO‑style ratio loss with clipping, the model is trained to directly predict the group‑based advantage ˆA(i) for each video‑text pair i. The loss is a simple L2 regression between the predicted advantage and the advantage computed from the original GRPO formulation. By removing clipping and min operations, gradient flow is no longer artificially limited, leading to faster and more stable convergence. Moreover, aligning the model’s output with the advantage value provides an explicit incentive to generate higher‑reward outputs.
Second, a difficulty‑aware data augmentation pipeline dynamically adjusts the hardness of training examples. Easy samples are perturbed by adding video noise, frame dropping, or color jitter, thereby raising their difficulty and preventing the model from over‑fitting trivial cases. Hard samples receive auxiliary reasoning cues such as textual hints or object‑level annotations, enriching the reward landscape and mitigating the vanishing‑advantage effect. This balanced augmentation yields a more uniform distribution of advantage values across the training set.
Experiments are conducted on three challenging benchmarks: SEED‑Bench‑R1 (video question answering), LongVideoBench (long‑duration reasoning), and NExTGQA (complex multi‑step video QA). DeepVideo‑R1 is built on the same backbone as Qwen2.5‑VL (both 3‑Billion and 7‑Billion parameter versions) to ensure a fair comparison. Results show substantial gains: +10.1 percentage points on SEED‑Bench‑R1, +8.6 pp on LongVideoBench, and +5.4 pp on NExTGQA, outperforming the original GRPO, DPO, and standard supervised fine‑tuning baselines. Convergence is accelerated by roughly 30 % fewer epochs, and gradient clipping incidents disappear entirely. The model also demonstrates improved out‑of‑distribution robustness, narrowing the performance gap between in‑distribution and OOD tasks.
Ablation studies confirm that both components are essential. Removing the regression formulation re‑introduces gradient clipping and slows learning, while omitting difficulty‑aware augmentation leads to a pronounced drop in advantage magnitude and a resurgence of the vanishing‑advantage problem.
The paper’s contributions are threefold: (1) a novel Reg‑GRPO scheme that eliminates heuristic stabilizers and directly optimizes advantage predictions; (2) a difficulty‑aware augmentation framework that diversifies reward signals and balances learning across difficulty levels; (3) the DeepVideo‑R1 model, which sets new state‑of‑the‑art results on multiple video reasoning benchmarks.
Limitations include the reliance on the quality of the original advantage estimator and the relatively coarse granularity of difficulty adjustments (currently applied at the video‑text pair level). Future work may explore finer‑grained group definitions, frame‑level advantage regression, and the integration of human feedback to construct hybrid reward models. Overall, DeepVideo‑R1 demonstrates that recasting policy‑gradient objectives as regression tasks, combined with targeted data augmentation, can substantially enhance the reasoning capabilities of multimodal large language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment