TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

Reading time: 5 minute
...

📝 Original Info

  • Title: TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering
  • ArXiv ID: 2512.16270
  • Date: 2025-12-18
  • Authors: Rui Gui, Yang Wan, Haochen Han, Dongxing Mao, Fangming Liu, Min Li, Alex Jinpeng Wang

📝 Abstract

Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.

💡 Deep Analysis

📄 Full Content

TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering Rui Gui1,* Yang Wan1,* Haochen Han2,* Dongxing Mao1 Fangming Liu2 Min Li1 Alex Jinpeng Wang1,† 1Central South University 2Pengcheng Laboratory *Equal contribution. †Corresponding author. Translate: Translate all the title and subtitles from English to Chinese Replace: Replace 1200 to 1100 Correct: Correct the spelling of the title‘Thought on Data Sience’ Font: Change the font of 'Dear Wear For Students' to imitate Song Dynasty Color: Turn black text into orange Style: Make this text 'The website is under maintenance' bold. Rotate: Rotate 'Off-the-Beaten-Path Destinations' 30 degrees to the lefts Swap: Swap the positions of '(lemon)' and 'Refreshing' Shift: Move the time in the picture down Text Delete: Delete the time Text Insert: Translate all the title and subtitles from English to Chinese Text Scaling: Expand the size of the ‘邀请’ Text Change Text Attribute Text Relocation Others Figure 1. Overview of TextEditBench. TextEditBench covers diverse text-in-image editing types such as translation, replacement, color and rotation adjustment, text removal, scaling, and creation, spanning both visual fidelity and reasoning-intensive semantic edits. For clarity, we highlight the major regions of modification with red rectangle. 1 arXiv:2512.16270v1 [cs.CV] 18 Dec 2025 Abstract Text rendering has recently emerged as one of the most chal- lenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unex- plored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehen- sive evaluation benchmark that explicitly focuses on text- centric regions in images. Beyond basic pixel manipula- tions, our benchmark emphasizes reasoning-intensive edit- ing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependen- cies. We further propose a novel evaluation dimension, Se- mantic Expectation (SE), which measures reasoning abil- ity of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent rea- soning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet funda- mental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and rea- soning in multimodal generation. 1. Introduction The ever-increasing volume of visual content has driven progress in text-guided image editing, which aims to mod- ify images flexibly through natural language instructions. Recent advances in diffusion models [16] and large vision- language models [38] have even shown generation capabil- ities indistinguishable to human observers. However, much of their success is confined to editing in- tuitive visual objects. When the target shifts to text embed- ded in the image, existing generative models exhibit no- table quality degradation, as textual content possesses high semantic density and is tightly coupled with layout, typog- raphy, perspective, and scene context—severely hindering their applications in real-world scenarios, e.g., advertising customization and watermark manipulation. Driven by the demand for precise text editing, text rendering[2, 19, 20, 22, 30, 31, 45] has become a key metric for evaluating generative foundation models. For example, half of the GPT-4o image generation demonstrations focus on text rendering scenarios. To systematically assess this capability, a series of benchmarks [9, 28, 30, 33, 50] has been proposed to measure controllability at the visual text level. However, existing benchmarks primarily focus on ba- sic pixel-level manipulations and overlook higher-level se- mantic challenges. Specifically, through empirical analy- sis of advanced models [4, 5, 14, 27, 39], we observe three common failure modes: (i) Text rendering and legibility. Models may hallucinate characters, misspell words, or dis- tort glyphs, deviating from the specified content. (ii) Vi- sual and stylistic consistency. Inserted or replaced text of- ten mismatches the background in font, color, perspective, or illumination, leading to “copy–paste” artifacts. (iii) Se- mantic consistency. Editing prices, labels, or chart values can break logical or factual relations in the scene, reveal- ing limited reasoning ability. Therefore, there is an urgent need for a benchmark that requires models to reason over contextual and logical relations among textual elements to preserve global coherence. In this work, we propose TextEditBench to unify the evaluation of editing textual content within images, textu

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut