GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?
The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
💡 Research Summary
The paper addresses a pressing problem in the evaluation of modern visual generation models, which have rapidly progressed from simple text‑to‑image synthesis to complex tasks such as image editing, composition, and text‑to‑video generation. Traditional automatic metrics (e.g., FID, CLIP‑Score) fail to capture fine‑grained semantic alignment and aesthetic quality, while human evaluation, though reliable, is costly and unscalable. Recent approaches have turned to Vision‑Language Models (VLMs) as surrogate judges, but they largely retain an absolute pointwise scoring paradigm.
Through systematic experiments across three benchmark suites—GenAI‑Bench (image generation), EditScore‑Bench (image editing), and VideoGen‑RewardBench (video generation)—the authors demonstrate two critical flaws of pointwise scoring: (1) self‑consistency collapse, where repeated evaluations of the same image‑prompt pair produce wildly different scalar scores, leading to unstable rankings; and (2) poor correlation with human preferences (Spearman ρ ≈ 0.36). The instability is traced to the cognitive difficulty of maintaining a consistent absolute grading rubric, a problem that is amplified when VLMs generate stochastic outputs.
To overcome these issues, the authors propose GenArena, a unified evaluation framework that replaces absolute scoring with a pairwise comparison protocol. Instead of asking a VLM to assign a numeric quality score, the system presents two generated images side‑by‑side with the same prompt and asks which one better satisfies the prompt. This binary decision mirrors how humans naturally compare outputs and dramatically reduces variance; the authors report a consistency score of 0.94 for pairwise judgments versus 0.61 for pointwise scores on repeated trials.
The pairwise outcomes are then aggregated using the Elo rating system, a well‑established method from competitive gaming that converts win‑loss records into stable skill scores. By feeding millions of pairwise judgments into Elo, GenArena constructs a dynamic leaderboard that reflects both overall performance and task‑specific strengths (basic editing, reasoning‑intensive editing, and multi‑reference composition).
Empirical results are striking. When evaluated with the pairwise protocol, off‑the‑shelf open‑source VLMs (e.g., Qwen‑3‑VL‑8B, GLM‑4.6V‑Flash, InternVL‑3.5‑8B) achieve a +20 % boost in binary classification accuracy over their pointwise counterparts, often surpassing proprietary models such as GPT‑5 and Gemini‑2.5 Pro. For example, Qwen‑3‑VL‑8B improves from 49.1 % (pointwise) to 60.5 % (pairwise) on GenAI‑Bench, while the best proprietary model reaches only 75.5 % with pointwise scoring. Moreover, the Elo rankings derived from GenArena exhibit a Spearman correlation of 0.86 with the human‑derived LMArena leaderboard, a dramatic improvement over the 0.36 correlation of pointwise methods.
A key insight is that open‑source VLMs do not require any fine‑tuning on costly human‑preference datasets to become effective judges; the mere switch to a relative comparison format unlocks their latent discriminative power. This challenges the prevailing belief that high‑quality evaluation necessitates large‑scale supervised alignment.
GenArena is released as a fully open‑source suite, including the prompt set (6,086 diverse user prompts), the pairwise battle pipeline, the Elo aggregation code, and an online leaderboard. Researchers can extend the benchmark by adding new prompts or tasks, and the system will automatically update Elo scores, ensuring reproducibility and continual relevance as visual generation models evolve.
In summary, the paper makes three major contributions: (1) a rigorous diagnosis of the shortcomings of absolute pointwise scoring for visual generation evaluation; (2) the design and validation of a pairwise‑Elo framework (GenArena) that delivers human‑aligned, consistent, and discriminative assessments; and (3) empirical evidence that open‑source VLMs, without any additional training, can outperform proprietary judges when evaluated under the pairwise paradigm. This work proposes a paradigm shift—from scoring to comparing—that is likely to become the new standard for evaluating visual generative AI.
Comments & Academic Discussion
Loading comments...
Leave a Comment