VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models
Large multimodal models (LMMs) have demonstrated outstanding capabilities in various visual perception tasks, which has in turn made the evaluation of LMMs significant. However, the capability of video aesthetic quality assessment, which is a fundamental ability for human, remains underexplored for LMMs. To address this, we introduce VideoAesBench, a comprehensive benchmark for evaluating LMMs’ understanding of video aesthetic quality. VideoAesBench has several significant characteristics: (1) Diverse content including 1,804 videos from multiple video sources including user-generated (UGC), AI-generated (AIGC), compressed, robotic-generated (RGC), and game videos. (2) Multiple question formats containing traditional single-choice questions, multi-choice questions, True or False questions, and a novel open-ended questions for video aesthetics description. (3) Holistic video aesthetics dimensions including visual form related questions from 5 aspects, visual style related questions from 4 aspects, and visual affectiveness questions from 3 aspects. Based on VideoAesBench, we benchmark 23 open-source and commercial large multimodal models. Our findings show that current LMMs only contain basic video aesthetics perception ability, their performance remains incomplete and imprecise. We hope our VideoAesBench can be served as a strong testbed and offer insights for explainable video aesthetics assessment. The data will be released on https://github.com/michaelliyunhao/VideoAesBench
💡 Research Summary
The paper introduces VideoAesBench, the first comprehensive benchmark specifically designed to evaluate large multimodal models (LMMs) on video aesthetic perception. Recognizing that existing LMM benchmarks focus largely on high‑level semantic tasks such as action recognition or spatial understanding, the authors argue that the ability to assess aesthetic quality—a fundamental human capability—remains largely untested for these models.
To construct the benchmark, the authors aggregate raw videos from ten publicly available datasets covering five distinct content categories: user‑generated content (UGC), AI‑generated content (AIGC), robot‑generated content (RGC), compressed videos, and gaming videos. After balanced sampling, the final pool consists of 1,804 videos spanning a wide range of resolutions, durations, and visual characteristics. Each video is paired with a carefully crafted question‑answer (Q‑A) item, resulting in 1,804 Q‑A triples.
The question design is a central contribution. Four question formats are employed: (1) True/False (binary judgment on an aesthetic statement), (2) Single‑Choice (one correct answer among four options), (3) Multiple‑Choice (more than one correct answer among four options), and (4) Open‑Ended (free‑form description without predefined choices). The inclusion of Multiple‑Choice and Open‑Ended items deliberately raises the difficulty level, probing whether models can capture all relevant aesthetic cues and generate coherent, explainable narratives.
Aesthetic dimensions are organized into three high‑level aspects—Visual Form, Visual Style, and Visual Affectiveness—further divided into twelve fine‑grained sub‑dimensions: Visual Composition, Visual Elements & Structure, Shot Size, Depth of Field, Visual Subject, Lighting, Color, Visual Tone, Creativity, Emotion, Theme & Communication, and Viewer Interest. This taxonomy enables a nuanced assessment of whether a model understands composition, color harmony, narrative relevance, emotional resonance, and other subtle qualities that humans routinely evaluate.
For evaluation, 23 LMMs are benchmarked, including 18 open‑source models (e.g., LLaVA‑Next, Qwen‑VL, MiniGPT‑4 variants) and 5 closed‑source commercial models (e.g., GPT‑4V, Gemini‑Pro Vision). Performance is measured using accuracy/F1 for closed‑ended questions and BLEU/ROUGE for open‑ended responses. The results reveal several key patterns:
-
Closed‑source superiority with exceptions – Commercial models generally outperform open‑source counterparts, yet Qwen‑3‑VL (an open‑source model) lags behind many closed‑source systems, indicating that model size alone does not guarantee aesthetic competence.
-
Question‑type difficulty gradient – Single‑Choice and True/False items achieve relatively high accuracies (≈70 % ± 5 %). In contrast, Multiple‑Choice and Open‑Ended questions see a steep drop to below 45 % accuracy, highlighting a gap in models’ ability to reason about multiple concurrent aesthetic factors and to articulate nuanced explanations.
-
Aspect‑wise performance imbalance – Models perform best on Visual Form sub‑dimensions (composition, shot size), while struggling with Visual Style (lighting, color harmony) and Visual Affectiveness (emotion, viewer interest). This suggests current LMMs are more attuned to structural cues than to affective or stylistic subtleties.
-
Content‑type sensitivity – All models exhibit notable performance degradation on compressed videos and robot‑generated footage, implying that distortion artifacts or unconventional camera motions impede aesthetic reasoning.
-
Error analysis – Common failure modes include misidentifying dominant colors, overlooking depth‑of‑field cues, and providing generic or contradictory emotional descriptions in open‑ended responses.
The authors conclude that while LMMs possess a rudimentary capacity for video aesthetic judgment, they fall short of the fine‑grained, multi‑dimensional understanding exhibited by humans. They advocate for future research directions: expanding diverse video sources, enriching fine‑grained aesthetic annotations, developing specialized prompting or fine‑tuning strategies for each sub‑dimension, and integrating explainable attention mechanisms that can surface the visual evidence behind a model’s aesthetic claim.
VideoAesBench itself is released publicly (GitHub link provided) and is positioned as a robust testbed for the community. By offering a standardized, richly annotated benchmark that spans multiple question formats and aesthetic facets, the work paves the way for more explainable, human‑aligned video assessment systems and sets a clear benchmark for measuring progress in this emerging research area.
Comments & Academic Discussion
Loading comments...
Leave a Comment