📝 Original Info
- Title: TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering
- ArXiv ID: 2512.16270
- Date: 2025-12-18
- Authors: Rui Gui, Yang Wan, Haochen Han, Dongxing Mao, Fangming Liu, Min Li, Alex Jinpeng Wang
📝 Abstract
Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.
💡 Deep Analysis
📄 Full Content
TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering
Rui Gui1,*
Yang Wan1,*
Haochen Han2,*
Dongxing Mao1
Fangming Liu2
Min Li1
Alex Jinpeng Wang1,†
1Central South University
2Pengcheng Laboratory
*Equal contribution.
†Corresponding author.
Translate: Translate all the title and
subtitles from English to Chinese
Replace: Replace 1200 to 1100
Correct: Correct the spelling of the
title‘Thought on Data Sience’
Font: Change the font of 'Dear Wear
For Students' to imitate Song Dynasty
Color: Turn black text into orange
Style: Make this text 'The website is
under maintenance' bold.
Rotate: Rotate 'Off-the-Beaten-Path
Destinations' 30 degrees to the lefts
Swap: Swap the positions of
'(lemon)' and 'Refreshing'
Shift: Move the time in the picture
down
Text Delete: Delete the time
Text Insert: Translate all the title
and subtitles from English to Chinese
Text Scaling: Expand the size of
the ‘邀请’
Text Change
Text Attribute
Text Relocation
Others
Figure 1. Overview of TextEditBench. TextEditBench covers diverse text-in-image editing types such as translation, replacement, color
and rotation adjustment, text removal, scaling, and creation, spanning both visual fidelity and reasoning-intensive semantic edits. For
clarity, we highlight the major regions of modification with red rectangle.
1
arXiv:2512.16270v1 [cs.CV] 18 Dec 2025
Abstract
Text rendering has recently emerged as one of the most chal-
lenging frontiers in visual generation, drawing significant
attention from large-scale diffusion and multimodal models.
However, text editing within images remains largely unex-
plored, as it requires generating legible characters while
preserving semantic, geometric, and contextual coherence.
To fill this gap, we introduce TextEditBench, a comprehen-
sive evaluation benchmark that explicitly focuses on text-
centric regions in images. Beyond basic pixel manipula-
tions, our benchmark emphasizes reasoning-intensive edit-
ing scenarios that require models to understand physical
plausibility, linguistic meaning, and cross-modal dependen-
cies. We further propose a novel evaluation dimension, Se-
mantic Expectation (SE), which measures reasoning abil-
ity of model to maintain semantic consistency, contextual
coherence, and cross-modal alignment during text editing.
Extensive experiments on state-of-the-art editing systems
reveal that while current models can follow simple textual
instructions, they still struggle with context-dependent rea-
soning, physical consistency, and layout-aware integration.
By focusing evaluation on this long-overlooked yet funda-
mental capability, TextEditBench establishes a new testing
ground for advancing text-guided image editing and rea-
soning in multimodal generation.
1. Introduction
The ever-increasing volume of visual content has driven
progress in text-guided image editing, which aims to mod-
ify images flexibly through natural language instructions.
Recent advances in diffusion models [16] and large vision-
language models [38] have even shown generation capabil-
ities indistinguishable to human observers.
However, much of their success is confined to editing in-
tuitive visual objects. When the target shifts to text embed-
ded in the image, existing generative models exhibit no-
table quality degradation, as textual content possesses high
semantic density and is tightly coupled with layout, typog-
raphy, perspective, and scene context—severely hindering
their applications in real-world scenarios, e.g., advertising
customization and watermark manipulation.
Driven by the demand for precise text editing, text
rendering[2, 19, 20, 22, 30, 31, 45] has become a key metric
for evaluating generative foundation models. For example,
half of the GPT-4o image generation demonstrations focus
on text rendering scenarios. To systematically assess this
capability, a series of benchmarks [9, 28, 30, 33, 50] has
been proposed to measure controllability at the visual text
level. However, existing benchmarks primarily focus on ba-
sic pixel-level manipulations and overlook higher-level se-
mantic challenges. Specifically, through empirical analy-
sis of advanced models [4, 5, 14, 27, 39], we observe three
common failure modes: (i) Text rendering and legibility.
Models may hallucinate characters, misspell words, or dis-
tort glyphs, deviating from the specified content. (ii) Vi-
sual and stylistic consistency. Inserted or replaced text of-
ten mismatches the background in font, color, perspective,
or illumination, leading to “copy–paste” artifacts. (iii) Se-
mantic consistency. Editing prices, labels, or chart values
can break logical or factual relations in the scene, reveal-
ing limited reasoning ability. Therefore, there is an urgent
need for a benchmark that requires models to reason over
contextual and logical relations among textual elements to
preserve global coherence.
In this work, we propose TextEditBench to unify the
evaluation of editing textual content within images, textu
Reference
This content is AI-processed based on open access ArXiv data.