VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning
Video quality assessment (VQA) aims to objectively quantify perceptual quality degradation in alignment with human visual perception. Despite recent advances, existing VQA models still suffer from two critical limitations: \textit{poor generalization to out-of-distribution (OOD) videos} and \textit{limited explainability}, which restrict their applicability in real-world scenarios. To address these challenges, we propose \textbf{VQAThinker}, a reasoning-based VQA framework that leverages large multimodal models (LMMs) with reinforcement learning to jointly model video quality understanding and scoring, emulating human perceptual decision-making. Specifically, we adopt group relative policy optimization (GRPO), a rule-guided reinforcement learning algorithm that enables reasoning over video quality under score-level supervision, and introduce three VQA-specific rewards: (1) a \textbf{bell-shaped regression reward} that increases rapidly as the prediction error decreases and becomes progressively less sensitive near the ground truth; (2) a \textbf{pairwise ranking reward} that guides the model to correctly determine the relative quality between video pairs; and (3) a \textbf{temporal consistency reward} that encourages the model to prefer temporally coherent videos over their perturbed counterparts. Extensive experiments demonstrate that VQAThinker achieves state-of-the-art performance on both in-domain and OOD VQA benchmarks, showing strong generalization for video quality scoring. Furthermore, evaluations on video quality understanding tasks validate its superiority in distortion attribution and quality description compared to existing explainable VQA models and LMMs. These findings demonstrate that reinforcement learning offers an effective pathway toward building generalizable and explainable VQA models solely with score-level supervision.
💡 Research Summary
Video Quality Assessment (VQA) is essential for evaluating perceptual degradation in video pipelines, yet current no‑reference (NR) models suffer from poor out‑of‑distribution (OOD) generalization and a lack of explainability. VQAThinker addresses both issues by integrating a large multimodal model (LMM) with reinforcement learning (RL) using Group Relative Policy Optimization (GRPO). Unlike traditional value‑based RL, GRPO updates the policy by comparing groups of generated responses, which naturally mirrors the human process of first reasoning about visual distortions and then assigning a quality score.
The framework introduces three VQA‑specific reward functions: (1) a bell‑shaped regression reward that grows sharply as the prediction error shrinks and saturates near the ground‑truth MOS, enabling fine‑grained score learning; (2) a pairwise ranking reward that aligns the relative order of model predictions with the relative order of ground‑truth MOS values, thereby preserving quality ordering across video pairs; and (3) a temporal consistency reward that penalizes differences between predictions on the original video and temporally perturbed versions (frame duplication, shuffle, etc.), encouraging the model to be sensitive to temporal artifacts such as motion blur or frame drops.
Architecturally, VQAThinker uses an off‑the‑shelf LMM as the backbone. A frozen motion feature extractor provides temporal dynamics, which are projected into the language space by a motion projector. The model receives a video and a textual prompt, then outputs a reasoning trace wrapped in
Training proceeds as follows: for each video in a batch, the current policy generates K candidate responses. Each response receives the three rewards, producing a reward vector. GRPO aggregates these rewards across the group to compute a policy gradient, updating the LMM parameters using only MOS supervision. No explicit quality‑description annotations are required, distinguishing VQAThinker from prior explainable VQA works that rely on large instruction datasets (e.g., Q‑Insight, OmniVQA‑Chat).
Experiments are conducted on the LSVQ training set and evaluated on both in‑domain benchmarks (LIVE‑VQC, KoNViD‑1k) and OOD datasets featuring unseen content and distortion types. VQAThinker consistently outperforms state‑of‑the‑art NR‑VQA methods (e.g., VMAF‑NR, FAST‑VQA) and recent LMM‑based approaches (Q‑Insight, VQ‑Insight) in terms of SRCC and PLCC, with particularly large gains on OOD data (3–5 percentage points). Additionally, the model is tested on distortion attribution and quality description tasks; despite being trained solely with MOS scores, it delivers competitive accuracy and generates natural‑language explanations that align with human judgments.
Key contributions are: (1) a reasoning‑driven VQA framework that jointly learns quality understanding and scoring using only score‑level supervision; (2) the design of three novel VQA‑specific rewards that enable fine‑grained regression, order preservation, and temporal distortion awareness; (3) empirical evidence of superior generalization and interpretability without extra annotation cost; and (4) a demonstration that reinforcement learning can effectively bridge the gap between language‑based reasoning and quantitative video quality assessment.
Future directions include lightweight model variants for real‑time deployment, online RL with user feedback, multimodal extensions incorporating audio and text, and interactive interfaces that leverage the generated reasoning traces for targeted video enhancement. VQAThinker thus establishes a new paradigm for building generalizable and explainable video quality assessors.
Comments & Academic Discussion
Loading comments...
Leave a Comment