GEBench: Benchmarking Image Generation Models as GUI Environments

GEBench: Benchmarking Image Generation Models as GUI Environments
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.


💡 Research Summary

The paper introduces GEBench, a benchmark specifically designed to evaluate image generation models when they are used as interactive graphical user interface (GUI) environments. Existing benchmarks focus on general visual fidelity or continuous video transitions, which do not capture the discrete, action‑driven state changes that characterize GUIs. GEBench fills this gap by providing 700 carefully curated samples organized into five task categories: (1) Single‑step visual transition, (2) Multi‑step planning (five‑step trajectories), (3) Zero‑shot virtual GUI generation for fictional apps, (4) Rare trajectory synthesis for real‑world apps, and (5) Grounding‑based generation where the model must render a state change at a precise normalized coordinate.

To assess performance, the authors propose GE‑Score, a five‑dimensional metric that scores Goal Achievement, Interaction Logic, Consistency, UI Plausibility, and Visual Quality on a 0‑5 scale and averages these dimensions into a single holistic score. This multi‑dimensional approach captures functional correctness, logical coherence, structural stability, realistic UI design, and perceptual quality (including text readability and icon clarity).

The benchmark is used to evaluate twelve state‑of‑the‑art models, including Google’s NanoBanana‑pro, OpenAI’s GPT‑image‑1.5/1.0, and Seedream variants. Results show that while models perform reasonably well on single‑step tasks (average scores around 80/100), they struggle dramatically on multi‑step planning and grounding tasks, often falling below 50/100. The main failure modes are misinterpretation of icons, poor rendering of Chinese text, and inaccurate spatial grounding, leading to layout drift and logical inconsistencies across steps.

The authors analyze why current architectures falter: they excel at local texture and color synthesis but lack robust understanding of global layout semantics and discrete interaction logic. They suggest three research directions to close the gap: (1) dedicated pre‑training on icon and text corpora, (2) conditional decoding mechanisms that incorporate precise coordinate information, and (3) enhanced sequence modeling (e.g., retrieval‑augmented Transformers) to maintain long‑term coherence.

By releasing the dataset, code, and evaluation protocol, the paper provides the community with a common yardstick for future work. GEBench thus establishes a necessary foundation for turning generative image models into reliable, high‑fidelity GUI simulators that can support autonomous agents, UI prototyping, and accessibility research.


Comments & Academic Discussion

Loading comments...

Leave a Comment