Unsafe2Safe: Controllable Image Anonymization for Downstream Utility

Unsafe2Safe: Contr ollable Image Anonymization f or Downstr eam Utility Minh Dinh SouY oung Jin Dartmouth College { Minh.T.Dinh.GR, SouYoung.Jin } @dartmouth.edu Abstract Lar ge-scale image datasets frequently contain identiﬁable or sensitive content, raising privacy risks when tr aining models that may memorize and leak such information. W e pr esent Unsafe2Saf e 1 , a fully automated pipeline that de- tects privacy-pr one imag es and r ewrites only their sensitive r egions using multimodally guided diffusion editing. Un- safe2Safe operates in two stages. Stage 1 uses a vision– language model to (i) inspect images for privacy risks, (ii) generate pair ed priv ate and public captions that respec- tively include and omit sensitive attrib utes, and (iii) pr ompt a large language model to pr oduce structur ed, identity- neutral edit instructions conditioned on the public cap- tion. Stage 2 employs instruction-driven diffusion editors to apply these dual te xtual pr ompts, pr oducing privacy- safe images that pr eserve global structure and task-r elevant semantics while neutr alizing private content. T o mea- sur e anonymization quality , we intr oduce a uniﬁed evalu- ation suite covering Quality , Cheating , Pri vac y , and Util- ity dimensions. Acr oss MS-COCO, Caltech101 and MIT Indoor67, Unsafe2Safe r educes face similarity , text sim- ilarity , and demographic pr edictability by lar ge mar gins, while maintaining downstr eam model accuracy compara- ble to training on raw data. F ine-tuning dif fusion edi- tors on our automatically g enerated triplets (private cap- tion, public caption, edit instruction) further impr oves both privacy pr otection and semantic ﬁdelity . Unsafe2Safe pro- vides a scalable, principled solution for constructing larg e, privacy-safe datasets without sacriﬁcing visual consistency or downstr eam utility . 1. Introduction Large-scale image datasets underpin modern computer vi- sion, yet they contain faces, ID badges, personal documents, or other signals that expose indi viduals to pri vac y risks. These risks are ampliﬁed by the tendency of deep netw orks to memorize training data [ 43 , 50 ] and thus are vulnera- 1 https://see-ai-lab .github.io/unsafe2safe/ UNSA FE SAFE Structurally Co nsistent Iden tity Privacy Demograph ic Neutralization Non Human -centr ic Ano nymization Figure 1. Examples from Unsafe2Safe (U2S). For each case, the model con verts an unsafe image into a pri vac y-preserving safe ver - sion. Examples demonstrate key capabilities that may appear si- multaneously: (1) structure-preserving full body anon ymization, (2) demographic neutralization (race entropy ↑ ), and (3) obfusca- tion of non-human conﬁdential details. ble to sensitive information extraction through model inv er- sion [ 31 , 41 ] or membership attack [ 15 , 39 ]. As datasets grow larger and more heterogeneous, exhaustiv e manual au- diting becomes infeasible, forming a barrier to sharing data and training models responsibly at scale. Existing anonymization methods offer only partial pro- tection. Preceeding redaction, e.g. blurring, masking, and inpainting, there is often a detector that often overlooks pri- vac y cues outside predeﬁned regions and introduces arti- facts that impair do wnstream utility . The core challenge is that anon ymization is not merely a remov al problem: sensi- tiv e content must be rewritten so that identity cues are sup- pressed without destroying the semantic structure needed for learning. An effecti ve solution must therefore combine strong priv acy protection with semantic ﬁdelity and broad applicability in open-domain images, without relying on ex- pensiv e manual annotation. Anonymization thus requires ﬁne-grained control ov er which pixels to edit and how to change them without compromising non-priv ate semantics. In this work, we introduce Unsafe2Safe (U2S) , a scal- able anonymization framework that uses multimodal rea- soning and text-guided dif fusion to produce pri vacy-safe yet utility-preserving versions of image datasets. Our key idea is to combine two complementary textual signals that jointly capture the semantics of the original scene while specifying how sensitiv e content should be altered. The ov erall process proceeds in tw o steps: ﬁrst generating priv acy-aw are te xtual guidance, and then applying con- trolled image editing conditioned on this guidance. Firstly , a vision–language model (VLM) inspects each image using predeﬁned pri vacy criteria and ﬂags those con- taining sensiti ve content. For unsafe images, it produces two captions: a private caption describing the full scene and a public caption that remov es identity-speciﬁc details while preserving non-sensitiv e semantics. As the public caption speciﬁes the desired safe outcome but not the transforma- tion itself, a large language model (LLM) further generates a structured edit instruction that proposes identity-neutral substitutions and deﬁnes the required modiﬁcations. T o- gether , the public caption and edit instruction determine what should be preserved and what should be altered. Conditioned on these textual signals, a text-guided dif- fusion editor performs targeted anonymization while main- taining visual coherence. The public caption acts as a se- mantic anchor that preserves global layout and object re- lationships, whereas the edit instruction guides geometry- and style-consistent rewriting of sensitive attributes. Un- like detector -based pipelines, Unsafe2Safe does not require segmentation masks, attrib ute labels, or predeﬁned privac y taxonomies. Instead, it leverages modern dif fusion editors such as InstructPix2Pix[ 7 ] and Flo wEdit[ 24 ], which sup- port multi-prompt conditioning to modify only the regions implied by the instructions. This design yields pri vacy- neutral reconstructions, remo ving identity cues and other sensitiv e or unsafe attributes while retaining structural ﬁ- delity and a voiding the semantic drift often observ ed in full image regeneration. Figure 1 illustrates how Unsafe2Safe selectiv ely anonymizes sensitive regions while preserving recognizability and downstream task rele vance. Through extensiv e experiments on Caltech-101 [ 13 ], MIT Indoor-67 [ 37 ], MS-COCO [ 29 ] and OK-VQA [ 32 ], we show that Unsafe2Safe achie ves strong anonymization while maintaining or e ven improving do wnstream task per- formance. Models trained on Unsafe2Safe data match the accuracy of models trained on raw images, and in some cases exceed them due to remov al of spurious correlations. Unsafe2Safe also reduces face similarity and text similarity , increases demographic div ersity , and av oids semantic drift. Compared to baseline anonymization techniques, our ap- proach lev erages multimodal reasoning, dual-caption con- ditioning, and targeted dif fusion editing, which together provide ﬁner control, higher ﬁdelity , and better scalability . Our main contributions are summarized as follo ws: (1) W e introduce Unsafe2Safe , a controllable priv acy- preserving dif fusion frame work that combines VLM- guided priv acy inspection, public caption generation, and LLM-deri ved edit instructions to remo ve identity- sensitiv e content while preserving task-rele vant seman- tics and spatial layout. (2) W e de velop a scalable anonymization pipeline and dataset construction process that enables safe proxy generation across div erse domains. W e release priv acy- safe generated datasets along with tools for anonymiz- ing additional datasets, supporting reproducibility and broad community adoption. (3) W e propose a uniﬁed evaluation framework for anonymization, introducing four metric groups that jointly quantify image quality , leakage pathways, de- mographic div ersity , and downstream utility . (4) Through extensi ve experiments on Caltech-101, MIT Indoor-67, MS-COCO, and OK-VQA, we sho w that Unsafe2Safe pr eserves or impro ves downstream ac- curacy while substantially reducing face similarity , text similarity , and demographic predictability , out- performing existing anon ymization baselines. 2. Related W ork 2.1. Privacy-Pr eserving Data Generation Our work follo ws a data-centric approach to pri vacy: in- stead of protecting models after training, we anonymize im- ages beforehand so that sensiti ve information ne ver enters model weights. Most anonymization pipelines rely on face or person detectors [ 18 , 21 , 22 , 25 ] and then apply blur - ring, masking, pixelation [ 17 , 33 ], or generati ve inpaint- ing [ 22 , 36 ]. Others detect risks via embedding similar- ity [ 23 ]. Y et, their reliance on the closed-set detector limits their applicability to open-domain imagery where pri v acy- encoding signals extend far be yond faces. VLMs and LLMs [ 3 , 8 , 16 , 52 ] offer a promising alter - nativ e for zero-shot priv acy inspection and natural-language reasoning about sensiti ve content [ 33 , 46 ]. Howe ver , ex- isting pipelines use such models only for detection, with- out le veraging their semantic understanding to guide how sensitiv e content should be modiﬁed. Unsafe2Safe ﬁlls this gap by coupling VLM-based pri vacy inspection with LLM-generated edit instructions and executing the transfor- mation using te xt-guided dif fusion editing, fostering struc- tured, context-a ware anon ymization in open-set categories. 2.2. Privacy Ev aluation Frameworks Prior work measures pri vacy via re-identiﬁcation at- tacks [ 11 , 25 ] or identity/attribute classiﬁcation degrada- tion [ 9 , 14 ], but treats priv acy and utility as separate objec- tiv es. Datasets that annotate both [ 6 , 40 ] are limited in scale, often restricted to binary human-centric attributes, and pre- dominantly video-based, where utility depends on temporal cues rather than ﬁne-grained spatial semantics. More comprehensiv e benchmarks [ 1 , 33 ] combine pri- vac y and utility considerations b ut rely on human-annotated priv acy labels and automatically generated utility labels, an in version of real needs: pri vacy can often be inferred from general knowledge, b ut utility annotations require do- main expertise and are costly . Our evaluation approach re verses this imbalance. W e use VLMs for scalable, zero-shot privac y judgments and use original downstream datasets to measure utility . The resulting framew ork jointly assesses: (1) semantic ﬁdelity , (2) residual priv acy leakage, (3) fairness via demographic div ersity , and (4) downstream performance when training on anonymized data—aligning ev aluation with real-world deployment constraints. 2.3. Controllable Image Editing Anonymization with Diffusion. Recent diffusion-based anonymization focuses on identity-neutral face synthe- sis [ 22 , 25 ], sometimes incorporating structural controls such as ControlNet [ 54 ] or ReferenceNet [ 20 ]. While these improv e realism, the y still rely hea vily on masks, detectors, or handcrafted attributes, limiting their ability to handle ar- bitrary privac y cues. Mask-based redaction also introduces artifact boundaries and harms do wnstream utility . Diffusion Editing. Advances in diffusion editing [ 7 , 30 , 53 , 57 ] allow text-guided modiﬁcations via hybrid condi- tioning or attention manipulation. Howe ver , most operate on UNet backbones with limited long-range modeling. Dif- fusion T ransformers (DiTs) [ 5 , 12 , 26 ] impro ve ﬁdelity and ﬂexibility , inspiring both training-based [ 44 , 45 , 56 ] and training-free editors [ 24 , 51 ]. FlowEdit [ 24 ], in particular , uses ODE-based paths for structure-preserving transforma- tions. Despite this progress, existing editors lack a mech- anism for determining what is sensitiv e or how to modify it while preserving non-pri vate semantics. Unsafe2Safe is the ﬁrst to integrate VLM-driv en priv acy reasoning, LLM edit-generation, and modern dif fusion editing into a uniﬁed, scalable anonymization pipeline for open-domain images. 3. Unsafe2Safe: A Privacy-Pr eserving Pipeline f or V isual Data W e introduce Unsafe2Safe, an automatic pipeline that de- tects unsafe images and transforms them into safe versions using modern diffusion-based generative models, while pre- serving utility for downstream tasks. Private caption 𝑐 𝑝𝑟𝑖𝑣 Edit instruction 𝑐 𝑒𝑑𝑖𝑡 A wo man playing tennis with another wo man behind Privacy Fi lter Safe A fema le tennis player with short hair Stage 1: Inspection Image Editor 𝒢 Public caption 𝑐 𝑝𝑢𝑏 A pers on playing tennis with another person behind Image 𝐈 Stage 2: Safe Image Generation Captioner Edit Idea Cr eator PRIVAC Y_FLAG Stage 2 LLM VLM Stage 1 Unsa fe Keep Run next steps Figure 2. Pipeline Overview . A VLM inspects the image for priv acy risks. For ﬂagged images, it generates a priv ate caption c priv and a public caption c pub without sensiti ve details. An LLM then produces an edit instruction c edit on how sensitiv e attributes should be modiﬁed. A diffusion editor uses these priors to generate a priv acy-safe image while preserving scene semantics. Recent diffusion models, such as Stable Diffusion and its v ariants [ 7 , 24 ], have demonstrated strong capability in translating visual content across a wide range of modalities when guided by textual conditions. Howe ver , these models require an instruction k that speciﬁes how the model should re-generate an image I . T o the best of our knowledge, no dataset currently provides such edit instructions. W e there- fore propose a novel approach that automatically generates captions describing the source image I and the correspond- ing edit instructions needed for safe regeneration. Building on these generativ e capabilities, our objective is to dev elop a diffusion-based frame work that transforms priv acy-prone images into priv acy-safe counterparts while preserving both structural and semantic ﬁdelity . Unsafe2Safe processes an image dataset and anonymizes unsafe regions by replacing or modifying priv ate content according to a safe description, while leaving non-priv ate and task-relev ant areas unchanged. T o achie ve this, Un- safe2Safe operates in two stages, as illustrated in Figure 2 . Stage 1 contains three components: image privacy inspec- tion , where a VLM identiﬁes priv acy-sensitiv e images; im- age captioning , where the same VLM produces both pri vate and public captions describing the image with and without sensitiv e attributes; and edit instruction gener ation , where an LLM creates neutral, identity-free modiﬁcation prompts based on the public captions. Stage 2 then applies an image editor to generate the ﬁnal safe images. 3.1. Stage 1: Inspection Image Privacy Inspection. W e provide a VLM agent with a predeﬁned set of pri v acy criteria and ask it to inspect each image to determine whether any criterion is present. If so, the image is marked as unsafe, and the system returns PRIVACY_FLAG=TRUE . T o minimize the risk of missing priv ate content, we deliberately allow a higher T ype I er- ror rate (false positiv es). The criteria used are deri ved from the priv ate attribute set in VISPR [ 34 ], which consolidates attributes based on both regulatory guidelines and widely accepted norms in cyberspace. Image Captioning. Using the PRIVACY_FLAG from the inspection step, we separate the images into those contain- ing pri v acy risks and those deemed safe. For each pri vate image, the VLM generates two captions: 1. Private caption ( c priv ): fully describes the scene, in- cluding priv ate attributes. 2. Public caption ( c pub ): describes the same scene while omitting all priv ate details. The public caption c pub serves as a modality-aligned, priv acy-preserving representation of the image and is later used as the base condition for instruction generation and safe image synthesis. Edit Instruction Generation. Public captions c pub de- scribe the source image but do not provide any information about how to edit the image to produce a safe alternative. They realistically capture the scene while omitting all pri- vate details, making them an ideal can vas for controlled at- tribute insertion. T o generate meaningful editing guidance, we lev erage an LLM to enrich the public captions with plau- sible pseudo-pri v ate details. The LLM is prompted to gen- erate plausible pseudo-pri v ate attributes that could apply to the scene, without relying on a ﬁxed predeﬁned list, and returns an edit instruction c edit describing how the image should be modiﬁed. W e employ an LLM to merge the tw o into a compact, instruction-style prompt that preserves both semantic grounding and edit-speciﬁc guidance, referring it as LLM text prior . 3.2. Stage 2: Safe Image Generation 3.2.1. Overall pipeline Stage 2 con verts each unsafe image into a pri vacy-safe counterpart using a te xt-guided diffusion editor . Our pipeline is model-agnostic and supports instruction-driv en editors such as InstructPix2Pix [ 7 ] and Flo wEdit [ 24 ]. The goal is to produce images that satisfy three criteria: (i) sen- sitiv e attributes are neutralized or replaced, (ii) non-priv ate semantics and spatial layout are preserv ed, and (iii) the re- sult remains useful for downstream tasks. Giv en the public caption c pub and the edit instruction c edit from Stage 1, the editor is conditioned on both signals. The edit instruction speciﬁes what must change, while the UNet’ s tra nsformer l ayer Self Attentio n    Cro ss Attentio n  FFNN    Safe Atte ntion  󰇛 󰇜 󰇛 󰇜 Safe Attention      VQ - VAE VQ -V AE       conca t    Denoise for 1 timeste p UNet   󰇛  󰇜       A per son i n base ball… CLIP CLIP         Repla ce wi th a g irl…  󰇛 󰇜 S a f e C r os s - A tt e n ti on   󰇛  󰇜  󰇛  󰇜   co n ca t al o n g t e x t d i m                 Si g m o i d   󰇛  󰇜       Figure 3. SafeAttention within UNet. The UNet transformer re- ceiv es tw o textual conditions: the edit instruction and the public caption. The Cross Attention module follows the standard cross- attention pathway , while an auxiliary Safe Attention module oper- ates on both embeddings to reinforce non-priv ate semantics during denoising. public caption anchors the model to the safe, task-relev ant description of the scene. Ho wever , standard diffusion ed- itors attend overwhelmingly to the instruction, which can cause the model to overwrite non-sensitiv e regions and lose details important for downstream utility . 3.2.2. Safe Cross Attention Modern instruction-based editing models tend to prioritize making changes rather than preserving non-sensiti ve con- tent. Because the source image alone serves as the only structural anchor, these models often over -apply edits or un- intentionally modify regions that should remain untouched, an undesirable behavior for priv acy-preserving anonymiza- tion. T o address this limitation, we introduce Safe Cross Attention , an auxiliary attention module that conditions the denoising process jointly on the public caption c pub and the edit instruction c edit . An overvie w of the new module and its placement in the transformer frame work is pro vided in Figure 3 . Safe Cross Attention mirrors the architecture of standard cross- attention but is extended to operate on a concatenated to- ken sequence: we combine the embeddings of c pub and c edit into a uniﬁed sequence, which is then projected into Key and V alue tensors. This design provides the model with two complementary signals during denoising: (1) seman- tic preserv ation, enforced by the public caption, and (2) tar - geted transformation, directed by the edit instruction. By having simultaneous access to both forms of conditioning, the model can preserve layout and non-priv ate objects while applying strong, localized edits to sensitiv e regions. 4. Results In this section, we present a comprehensiv e ev aluation of our Unsafe2Safe pipeline across priv acy , utility , cheat- ing, and realism metrics. W e compare against strong anonymization baselines, analyze tradeoffs between priv acy and downstream performance, and examine the effects of model variants and ﬁne-tuning. These experiments demon- strate that Unsafe2Safe provides remarkably stronger pri- vac y protection while maintaining competitive task utility . 4.1. Datasets Evaluation. For downstream ev aluation, we use multiple benchmarks: Caltech101 [ 13 ] for object classiﬁcation, MIT Indoor67 [ 37 ] for scene classiﬁcation, and MS-COCO [ 29 ] for image captioning and open-ended VQA based on the la- bels from OK-VQA [ 32 ]. Though not originally designed for pri vacy analysis, many images contain priv acy-sensiti ve content unrelated to their annotated categories. While prior work typically uses the ﬁrst two datasets only for clas- siﬁcation, we are the ﬁrst to employ them for ev aluating anonymization and pri vacy preserv ation. T raining Data for Anonymization. As existing anonymization datasets are limited either in scale or in the range of identity attributes they cover , we construct our own dataset to train a generalizable priv acy-preserving editor . W e use MS-COCO [ 29 ] (train2014 and val2014) as a large and div erse source of images and generate an edited coun- terpart for each image using [ 30 ], which preserves scene structure while altering subject appearance. W e follow the recommended conﬁguration from the original work and set the self-attention swapping probability to 0.4 to induce pro- nounced visual changes. Edits that fail to preserv e the orig- inal scene semantics are discarded; details of this ﬁltering procedure are provided in Appendix S1.2 . 4.2. Baselines & Model V ariants Image Anonymizer . W e compare our approach against two recent anonymization methods, DeepPri vacy2 [ 22 ] and FaceAnon [ 25 ], both of which detect and anon ymize faces and bodies. T o demonstrate the adaptability of our Unsafe2Safe pipeline, we experiment with a broad set of widely used image-editing models for performing anonymization from both UNet and DiT backbone fami- lies. For UNet editors, we evaluate FreePrompt [ 30 ] and InstructPix2Pix [ 7 ]. For DiT editors, we use Flo wEdit [ 24 ], the current state-of-the-art in image editing, built on Stable Diffusion 3 [ 12 ] backbone. W e further ﬁne-tune Instruct- Pix2Pix [ 7 ]. All InstructPix2Pix experiments are initialized from the pretrained MagicBrush weights [ 53 ]. T ext Priors. W e ev aluate ﬁve types of te xtual priors pro- duced in Stage 1 of our pipeline: c priv , class , c public , c edit , and an LLM-composed combination of ( c edit , c public ) . The class prompt provides a minimal class-le vel description A realistic image of . to anchor the global concept. The public caption c public contains a sani- tized description of the source image, while c edit speciﬁes where and how priv acy-sensitiv e regions should be modi- ﬁed. In all e xperiments, we used Qwen3-4B-Instruct [ 52 ] as the LLM and InternVL2.5 as the VLM agent for te xt prior generation. See Appendix S6.1 for the e xact prompts. 4.3. Evaluation Metrics T o comprehensi vely assess anonymization performance, we introduce a uniﬁed e valuation framew ork consisting of four metric groups: Quality , Cheating , Privacy , and Utility scores. These metrics quantify realism, unintended infor- mation leakage, demographic obscurity , and downstream task performance, respectiv ely . Quality Scor e. W e ev aluate visual realism via CLIP sim- ilarity [ 38 ] between each anonymized image and its public caption. Higher alignment indicates better preserv ation of the original scene’ s global semantic concept. Cheating Scores. T o measure unintended copying of pixel- lev el or perceptual cues, we compute (i) structural similar- ity (SSIM) [ 49 ] and (ii) perceptual similarity (LPIPS) [ 55 ]. Lower SSIM and higher LPIPS indicate greater deviation from the source image. W e refer to these as cheating scor es because anonymizers should not rely on reconstruct- ing original identity features to maintain realism. Privacy Scores. W e propose four pri vac y-focused metrics deriv ed from VLM-based analysis: • VLMScore ( ↑ ): A VLM recei ves the unsafe-safe image pair and assigns a score (0–100) indicating ho w effec- tiv ely priv acy-sensitiv e issues hav e been resolved. • FaceSim ( ↓ ): Cosine similarity between detected faces and their closest anon ymized counterparts using the Antev elop-v2 encoder [ 10 ]. • T extSim ( ↓ ): W e extract text from each anonymized im- age via a VLM and compute a tok en-set ratio, where lower similarity indicates successful remov al or distor- tion of identiﬁable textual content. • Race Entr opy ( ↑ ): A VLM predicts demographic at- tributes from the set R present in each image (White, Black, Asian, Hispanic, and/or Other), and we compute the normalized entropy: P ( r ) = coun t( r ) P r ′ ∈R coun t( r ′ ) , e = − P r ∈R P ( r ) log P ( r ) log K . where K = |R| is the number of race categories. Higher entropy indicates a more uniform, and thus less identity- speciﬁc, demographic distribution. W e refer to Appendix S6.1 for each task’ s VLM prompts. Priv acy scores are reported on all training samples for Cal- tech101 and Indoor67, and only on training samples ﬂagged as priv ate by our VLM for MS-COCO. Utility Score. T o measure task usefulness, we ﬁne-tune models on a wide range of do wnstream tasks, including T able 1. VLM pri vacy detector performance under dif ferent at- tribute ﬂagging criteria; higher recall means less pri vacy leakage. Privacy Criterion Recall ↑ Precision ↑ F1 ↑ All attributes 0.975 0.793 0.874 Face 0.850 0.927 0.887 Health indicators 0.892 0.678 0.770 V ehicles 0.829 0.435 0.570 Personal opinion 0.778 0.665 0.717 object classiﬁcation, scene classiﬁcation, image captioning, and open-ended VQA. For classiﬁcation, we train a classiﬁer on each anonymized training set and report top-1 accuracy . For im- age captioning, we ﬁne-tune a VLM [ 28 ] using the human- annotated MS-COCO captions and e valuate performance using BLEU-4 [ 35 ] and CIDEr [ 47 ], measuring alignment between generated captions and human references. For VQA, we also ﬁne-tune a VLM [ 52 ] and report VQA ac- curacy [ 4 ]. Importantly , all utility scores are computed on the original test sets, reﬂecting the degree of task-relev ant semantics preserv ation under the original data distrib ution rather than adaptation to anonymized data. These four metric groups provide a holistic e valuation of anonymization quality , information leakage, demographic priv acy , and practical utility , representing a ke y contribu- tion of our work to assess anonymization methods. More details on the implementation of these e v aluation metrics are provided in Appendix S2.4 . 4.4. Main Results Is Stage 1 r eliable? W e assess the reliability of the Stage 1 priv acy inspection module on VISPR [ 34 ], treating the task as binary priv acy detection. Since missed detec- tions directly translate to pri vacy leakage in the training set, recall is the most critical metric. As shown in T able 1 , the detector achiev es consistently high recall across priv acy cat- egories, indicating that only a negligible fraction of sensi- tiv e samples remain unﬂagged. Implementation details and attribute group deﬁnitions are pro vided in Appendix S2.3 . Does the Model Generate Realistic and Properly Anonymized Images? W e examine whether Un- safe2Safe produces visually coherent edits while ef fectively removing pri vacy-sensiti ve attributes. Figure 4 shows qual- itativ e results comparing prior privac y-preserving methods and our pipeline. All models preserve the global layout and structure of the original scene while replacing the identities of the person in the image. Importantly , Unsafe2Safe also anonymizes background elements and non-human objects (i.e. posters and ads on the board) whenever the y contain priv acy-unsafe cues—an ability lar gely absent in prior approaches, which tend to modify only facial regions. Orig inal DP2 FaceA non Unsafe2Safe          Unsafe2Safe Unsafe2Safe Unsafe2Safe Unsafe2Safe Figure 4. Qualitative comparison of anonymization outputs on Caltech101. Each image shows with a different model f amily ( top line ) and its textual condition ( bottom line ). All methods preserve the global layout of the original scene, but the Unsafe2Safe using FlowEdit [ 24 ], unlike face-only anonymizers (DP2, FaceAnon), modify background elements when the y contain priv acy-rele vant cues, while keeping ov erall scene composition intact. T able 2. Quality and Cheating scores for Caltech101 (Cal101) and MIT Indoor 67 (I67). Our Unsafe2Safe (U2S), which uses FlowEdit (FE) [ 24 ] as the underlying image editor, generates high- quality images comparable to other anonymization models. Unlike prior methods that rarely modify content beyond facial regions, U2S identiﬁes pri vacy-unsafe cues throughout the entire image and performs broader, context-aware edits. Best values are high- lighted in yellow , and second-best in light blue. Model Prior Quality Score Cheating Scor es CLIP ( ↑ ) SSIM ( ↓ ) LPIPS ( ↑ ) C101 I67 C101 I67 C101 I67 U2S (FE) c priv 0.3054 0.3411 0.8551 0.6497 0.1118 0.2815 class 0.2634 0.3149 0.8559 0.6708 0.1139 0.2600 c pub 0.3094 0.3448 0.8556 0.6497 0.1118 0.2803 c edit 0.2685 0.2946 0.8550 0.6580 0.1168 0.2846 LLM 0.2865 0.3251 0.8551 0.6484 0.1168 0.2896 DP2 – 0.3023 0.3512 0.9817 0.9608 0.0161 0.0365 FaceAnon – 0.3057 0.3452 0.9888 0.9443 0.0104 0.0812 This observation is corroborated quantitati vely in T a- ble 2 . CLIP scores remain comparable across methods, in- dicating similar perceptual realism. Y et, Unsafe2Safe con- sistently yields lower SSIM and higher LPIPS values, re- vealing stronger pri v acy-preserving transformations with- out compromising global scene semantics. The margins of change are smaller on Caltech101 due to its higher propor- tion of publicly av ailable, non-sensitiv e images. T rade-Off Between Utility and Privacy . T o ev aluate how much priv acy protection can be achie ved without sac- riﬁcing do wnstream performance, we analyze the trade- off between utility—measured by classiﬁcation accuracy , captioning scores (BLEU-4 and CIDEr), and VQA accu- racy—and pri v acy , assessed using VLMScore, FaceSim, T extSim, and Race Entropy . T able 3. Utility and privacy comparison on classiﬁcation tasks. Our Unsafe2Safe, which uses Flo wEdit (FE) [ 24 ] as the underlying image editor, achieves comparable performance on downstream tasks while successfully anonymizing priv acy-sensitiv e information, unlike other anonymization models. The best value (yello w) and the second-best v alue (blue) are highlighted per column. Model T ext Prior Utility Score Privacy Scor e Accuracy ( ↑ ) VLMScore ( ↑ ) FaceSim ( ↓ ) T extSim ( ↓ ) Race Entr opy ( ↑ ) Cal101 Indoor Cal101 Indoor Cal101 Indoor Cal101 Indoor Cal101 Indoor Raw Images – 94.277 83.881 7.700 0.763 1.0000 1.0000 1.0000 1.0000 0.4384 0.7443 Unsafe2Safe (FE) c priv ate 94.334 79.925 9.646 8.009 0.4378 0.2666 0.6611 0.4210 0.5552 0.7399 class 93.857 80.448 12.555 7.512 0.4965 0.3743 0.4856 0.2077 0.4051 0.6508 c public 94.487 80.746 9.873 7.440 0.4881 0.2883 0.5395 0.2896 0.6409 0.7208 c edit 94.792 77.090 13.966 21.390 0.3658 0.2077 0.5238 0.2393 0.7646 0.7589 LLM ( c edit , c public ) 92.884 80.746 12.695 14.937 0.3428 0.2294 0.4881 0.2119 0.8751 0.7643 DeepPriv acy2 [ 22 ] – 94.601 84.030 11.053 0.767 0.3921 0.3547 0.9569 0.8653 0.7315 0.7547 FaceAnonSimple [ 25 ] – 94.849 84.030 8.757 1.233 0.4586 0.5045 0.9355 0.7701 0.6091 0.7407 T able 4. Utility and priv acy comparison on MS-COCO cap- tioning. U2S achieves higher VLMScore and much lower FaceSim/T extSim. Our Unsafe2Safe uses FreePrompt (FP) [ 30 ] and FlowEdit (FE) [ 24 ] as the underlying image editor . Model T ext Prior Utility Privacy BLEU-4 ↑ CIDEr ↑ VLMScore ↑ FaceSim ↓ T extSim ↓ RaceEnt ↑ Raw – 0.436 1.413 0.646 1.0000 1.0000 0.6170 U2S (FP) c edit 0.433 1.390 32.002 0.2013 0.1442 0.7364 U2S (FE) LLM 0.423 1.363 36.641 0.1975 0.1276 0.7282 FaceAnon – 0.444 1.429 1.622 0.4436 0.8123 0.6458 T able 5. Utility and privacy comparison on OK-VQA. U2S at- tains the highest VQA accuracy while achieving the strongest pri- vac y scores. Best (yellow) v alues are highlighted per column. Model T ext Prior Utility Privacy Acc VQA ( ↑ ) VLMScore ( ↑ ) FaceSim ( ↓ ) T extSim ( ↓ ) RaceEnt ( ↑ ) Raw – 0.6064 0.600 1.0000 1.0000 0.6288 U2S (FP) c edit 0.6573 33.192 0.2041 0.1514 0.7499 U2S (FE) LLM 0.7093 37.059 0.1951 0.1196 0.7397 FaceAnon – 0.6632 1.705 0.4483 0.8158 0.6568 Utility . As shown in T able 3 , classiﬁers trained on U2S- anonymized datasets achiev e performance close to models trained on the original images, with differences as small as 0.5 points on Caltech101 and 4.6 points on Indoor67. A larger drop occurs only when anonymization relies on FlowEdit with the edit instruction, likely due to weaker alignment between the instruction formulation and dataset captions. A similar pattern appears in generative tasks. On MS- COCO captioning (T able 4 ), BLEU-4 and CIDEr remain near the ra w baseline despite strong priv acy transforma- tions. On OK-VQA (T able 5 ), models trained on U2S- generated images achieve accuracy comparable to, and in our experiments slightly e xceeding, those trained on ra w or face-anonymized data. These results indicate that the anonymization process preserv es task-relev ant semantics required for language grounding and reasoning. T able 6. Race distribution (%) of images of different models. Model %White %Black %Asian %Hispanic %Other Raw Images 80.28 2.82 5.63 4.23 7.04 U2S (FE, c edit ) 45.93 29.63 13.33 0.00 11.11 U2S (FE, LLM ) 37.90 25.81 17.74 2.42 16.13 DP2 [ 22 ] 56.70 3.09 19.59 4.12 16.49 FaceAnon [ 25 ] 70.67 4.00 10.67 4.00 10.67 Privacy . Unsafe2Safe (U2S) consistently deliv ers stronger priv acy protection across all metrics. It achieves sub- stantially higher VLMScor e , and thus stronger remov al of instance-lev el identity cues, as ev aluated by the VLM judge. This suggests that anonymization is not limited to pixel- lev el modiﬁcations but effecti vely disrupts semantic iden- tity linkage. In all datasets, U2S signiﬁcantly reduces face similarity compared to existing anonymizers, achiev- ing up to 5 points on Caltech101, 15 points on Indoor67, and 24 points on the captioning and VQA training sets. T extSim v alues are like wise markedly lower , sho wing that identiﬁable textual content is frequently altered or ren- dered unreadable. Although Caltech101 exhibits a highly ske wed demographic distrib ution, U2S introduces notice- ably r ace entr opy (i.e., higher diversity). T able 6 provides the breakdown of race distributions for all models. Addi- tional experiments demonstrate that demographic attrib utes can be further controlled through textual conditioning (Ap- pendix S4.3 ). Overall, Unsafe2Safe preserves downstream performance while providing stronger pri vacy protection. T able 7. Performance of InstructPix2Pix [ 7 ] with different text priors, and after ﬁne-tuning on our anon ymization dataset ( ✓ ). Fine- tuning notably improves both utility and pri vacy: the ﬁne-tuned c edit model achie ves the highest Caltech101 accuracy (95.116%) among all variants, while also reducing FaceSim/T extSim on Indoor67 and increasing race entropy , indicating more div erse and less identity-speciﬁc outputs. Safe denotes the model trained with Safe Attention. Finetuned T ext Prior Utility Score Privacy Scor e Accuracy ( ↑ ) VLMScore ( ↑ ) FaceSim ( ↓ ) T extSim ( ↓ ) Race Entr opy ( ↑ ) Cal101 Indoor Cal101 Indoor Cal101 Indoor Cal101 Indoor Cal101 Indoor c priv ate 94.773 80.746 11.651 6.559 0.5571 0.3382 0.6553 0.3112 0.5642 0.7414 class 94.887 79.776 10.674 5.943 0.6888 0.3500 0.6176 0.2898 0.5528 0.7678 c public 94.582 81.791 10.814 5.646 0.5401 0.3389 0.6589 0.3044 0.6497 0.7599 c edit 94.315 81.418 16.878 16.926 0.5164 0.2909 0.6126 0.3399 0.6826 0.7493 LLM ( c edit , c public ) 93.762 81.119 13.699 12.754 0.4792 0.2857 0.6455 0.3112 0.7746 0.8090 ✓ c edit 95.116 81.716 12.595 16.250 0.5910 0.2735 0.6176 0.2640 0.7997 0.8356 ✓ LLM ( c edit , c public ) 94.601 81.493 12.145 10.804 0.5150 0.2970 0.6464 0.3090 0.8308 0.8090 ✓ Safe c edit , c public 94.887 80.149 13.366 17.801 0.5468 0.2465 0.5966 0.2660 0.8306 0.8448 Original Unsafe2Saf e Unsafe2Safe (FT) Figure 5. Qualitative examples showing that the InstructPix2Pix model ﬁnetuned on our dataset ( FT ) more effecti vely anonymizes sensitiv e content while preserving original class semantics, com- pared to the model trained on general editing data. Is Fine-T uning and SafeAttention Helpful? W e further ev aluate whether ﬁne-tuning InstructPix2Pix on our Un- safe2Safe dataset impro ves utility and pri vacy performance. T able 7 reports quantitati ve results before and after ﬁne- tuning, and Figure 5 sho ws representative e xamples. Under the same c edit prior , the ﬁne-tuned model achieves more ef- fectiv e anonymization than its non-ﬁne-tuned counterpart, better removing facial identity cues while preserving scene composition (e.g., the bar layout). In the top example, the non-ﬁne-tuned Unsafe2Safe output leaves residual identity cues in the face. This observ ation is consistent with the general-purpose design of InstructPix2Pix, and highlights the beneﬁt of task-speciﬁc ﬁne-tuning on our dataset. Fine-tuning provides clear beneﬁts. The ﬁne-tuned c edit model achie ves the highest Caltech101 accurac y (95.116%) among all InstructPix2Pix variants, while also improving priv acy metrics such as FaceSim and T extSim on Indoor67. Notably , the ﬁne-tuned models also exhibit stronger race entropy , indicating a more diverse and less identity-speciﬁc demographic distribution compared to their non-ﬁnetuned counterparts. These results sho w that ﬁne-tuning helps adapt the editor to the anonymization task, impro ving both utility and priv acy outcomes. The SafeAttention v ariant further improv es anonymiza- tion stability . Compared to the non-ﬁnetuned and stan- dard ﬁnetuned models, the SafeAttention model achie ves the strongest o verall priv acy protection, including the low- est Indoor FaceSim and the highest Caltech and Indoor race entropy as well as Indoor VLMScore. These gains come with no noticeable loss in utility , demonstrating that SafeAttention helps guide the editor toward safer and more demographic-neutral outputs. See Appendix S4.2 for more analysis on the attention maps about guidance from c public . 5. Conclusion W e presented Unsafe2Safe, a fully automated framework for transforming priv acy-prone images into priv acy-safe yet semantically faithful counterparts. By combining VLM- based inspection, public/priv ate captioning, and LLM- generated edit instructions with dif fusion-based editors, the system rewrites sensitive regions while preserving global structure and task-relev ant semantics. Priv acy criteria can be speciﬁed via te xtual prompts, and components can be in- stantiated with different VLMs and LLMs. W e introduced a uniﬁed ev aluation suite spanning image quality , leakage, priv acy attributes, and do wnstream util- ity , enabling holistic assessment of anonymization methods. Experiments sho w that Unsafe2Safe signiﬁcantly reduces identity leakage while maintaining downstream accurac y , supporting scalable priv acy-aw are dataset construction. Finally , Unsafe2Safe is a conﬁgurable dataset construc- tion tool, not an autonomous arbiter of priv acy . Respon- sibility for deﬁning pri vacy criteria and deployment con- straints rests with practitioners. Acknowledgments This work was supported by startup funds provided by Dart- mouth College. References [1] Sara Abdulaziz, Giacomo D’amicantonio, and Egor Bon- darev . Evaluation of human visual pri vacy protection: Three- dimensional frame work and benchmark dataset. In Pr oceed- ings of the IEEE/CVF International Confer ence on Com- puter V ision (ICCV) , pages 5893–5902, 2025. 3 [2] Abubakar Abid, Ali Abdalla, Ali Abid, Dawood Khan, Ab- dulrahman Alfozan, and James Zou. Gradio: Hassle-free sharing and testing of ml models in the wild. arXiv preprint arXiv:1906.02569 , 2019. 8 [3] Josh Achiam, Steven Adler , Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint , 2023. 2 [4] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: V isual Question Answering. In Pr oceedings of the IEEE/CVF International Conference on Computer V i- sion (ICCV) , 2015. 6 [5] Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, P atrick Esser , Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-conte xt image generation and editing in latent space. arXiv e-prints , pages arXiv–2506, 2025. 3 [6] Gibran Benitez-Garcia, Jesus Oliv ares-Mercado, Gabriel Sanchez-Perez, and Keiji Y anai. Ipn hand: A video dataset and benchmark for real-time continuous hand gesture recog- nition. In International Conference on P attern Recognition (ICPR) , pages 4340–4347. IEEE, 2021. 3 [7] Tim Brooks, Aleksander Holynski, and Alex ei A Efros. In- structpix2pix: Learning to follow image editing instruc- tions. In Pr oceedings of the IEEE/CVF Confer ence on Com- puter V ision and P attern Recognition (CVPR) , pages 18392– 18402, 2023. 2 , 3 , 4 , 5 , 8 [8] Zhe Chen, W eiyun W ang, Y ue Cao, Y angzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Y e, Hao T ian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling. arXiv pr eprint arXiv:2412.05271 , 2024. 2 [9] Ishan Rajendrakumar Dav e, Chen Chen, and Mubarak Shah. Spact: Self-supervised pri vacy preservation for action recog- nition. In Pr oceedings of the IEEE/CVF Conference on Com- puter V ision and P attern Recognition (CVPR) , pages 20164– 20173, 2022. 3 [10] deepinsight. Github - deepinsight/insightface: State-of-the- art 2d and 3d face analysis project, 2023. 5 , 2 [11] Anil Egin, Andrea T angherloni, and Antitza Dantchev a. Now you see me, no w you don’t: A uniﬁed framework for ex- pression consistent anonymization in talking head videos. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision (ICCV) , pages 5925–5934, 2025. 3 [12] Patrick Esser , Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨ uller , Harry Saini, Y am Levi, Dominik Lorenz, Ax el Sauer , Frederic Boesel, Dustin Podell, T im Dockhorn, Zion English, and Robin Rombach. Scaling rec- tiﬁed ﬂow transformers for high-resolution image synthesis. In Pr oceedings of the International Conference on Machine Learning (ICML) , pages 12606–12633. PMLR, 2024. 3 , 5 [13] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ativ e visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In Proceedings of the IEEE Computer Society Confer ence on Computer V ision and P attern Recognition W orkshops (CVPRW), W orkshop on Gener ative-Model Based V ision , pages 178–178. IEEE, 2004. 2 , 5 , 1 , 3 , 7 [14] Joseph Fioresi, Ishan Rajendrakumar Da ve, and Mubarak Shah. T ed-spad: T emporal distinctiveness for self- supervised pri vacy-preserv ation for video anomaly detec- tion. In Pr oceedings of the IEEE/CVF International Confer - ence on Computer V ision (ICCV) , pages 13598–13609, 2023. 3 [15] W enjie Fu, Huandong W ang, Chen Gao, Guanghua Liu, Y ong Li, and T ao Jiang. Membership inference attacks against ﬁne-tuned large language models via self-prompt cal- ibration. Advances in Neural Information Pr ocessing Sys- tems (NeurIPS) , 37:134981–135010, 2024. 1 [16] Aaron Grattaﬁori, Abhimanyu Dubey , Abhina v Jauhri, Ab- hinav Pandey , Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur , Alan Schelten, Alex V aughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024. 2 [17] Kristen Grauman, Andrew W estbury , Eugene Byrne, Zachary Cha vis, Antonino Furnari, Rohit Girdhar , Jackson Hambur ger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Pr o- ceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , pages 18995–19012, 2022. 2 [18] Ditmar Hadera, Jan Cech, Mirosla v Purkrabek, and Matej Hoffmann. Blanket: Anonymizing faces in infant video recordings. In IEEE International Conference on Develop- ment and Learning (ICDL) , pages 1–8. IEEE, 2025. 2 [19] Kaiming He, Xinlei Chen, Saining Xie, Y anghao Li, Piotr Doll ´ ar , and Ross Girshick. Masked autoencoders are scalable vision learners. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , pages 16000–16009, 2022. 3 [20] Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pages 8153–8163, 2024. 3 [21] H ˚ akon Hukkel ˚ as, Rudolf Mester , and Frank Lindseth. Deeppriv acy: A generati ve adversarial network for face anonymization. In International symposium on visual com- puting (ISVC) , pages 565–578. Springer , 2019. 2 [22] H ˚ akon Hukkel ˚ as and Frank Lindseth. Deeppriv acy2: T o- wards realistic full-body anonymization. In Proceedings of the IEEE/CVF W inter Confer ence on Applications of Com- puter V ision (W ACV) , pages 1329–1338, 2023. 2 , 3 , 5 , 7 , 1 , 4 [23] Filip Ilic, He Zhao, Thomas Pock, and Richard P W ildes. Selectiv e interpretable and motion consistent priv acy at- tribute obfuscation for action recognition. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pages 18730–18739, 2024. 2 [24] Vladimir Kulikov , Matan Kleiner, Inbar Huberman- Spiegelglas, and T omer Michaeli. Flo wedit: In version- free text-based editing using pre-trained ﬂow models. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision (ICCV) , pages 19721–19730, 2025. 2 , 3 , 4 , 5 , 6 , 7 [25] Han-W ei Kung, T uomas V aranka, Sanjay Saha, T erence Sim, and Nicu Sebe. Face anonymization made simple. In Pro- ceedings of the IEEE/CVF W inter Confer ence on Applica- tions of Computer V ision (W ACV) , pages 1040–1050. IEEE, 2025. 2 , 3 , 5 , 7 , 1 , 4 [26] Black F orest Labs. Flux. https : / / github . com / black- forest- labs/flux , 2024. 3 , 2 [27] Y ebin Lee, Imseong Park, and Myungjoo Kang. FLEUR: An explainable reference-free e valuation metric for image cap- tioning using a large multimodal model. In Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (V olume 1: Long P apers) , pages 3732– 3746, Bangkok, Thailand, 2024. Association for Computa- tional Linguistics. 3 [28] Junnan Li, Dongxu Li, Silvio Sav arese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Pr o- ceedings of the International Confer ence on Mac hine Learn- ing (ICML) , pages 19730–19742. PMLR, 2023. 6 , 3 [29] Tsung-Y i Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Dev a Ramanan, Piotr Doll ´ ar , and C La wrence Zitnick. Microsoft coco: Common objects in conte xt. In Eu- r opean Confer ence on Computer V ision (ECCV) , pages 740– 755. Springer , 2014. 2 , 5 , 1 , 3 [30] Bingyan Liu, Chengyu W ang, Tingfeng Cao, Kui Jia, and Jun Huang. T owards understanding cross and self-attention in stable diffusion for te xt-guided image editing. In Pr oceed- ings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pages 7817–7826, 2024. 3 , 5 , 7 , 2 , 4 [31] Zhe Ma, Qingming Li, Xuhong Zhang, T ianyu Du, Ruix- iao Lin, Zonghui W ang, Shouling Ji, and W enzhi Chen. An in version-based measure of memorization for dif fusion mod- els. In Proceedings of the IEEE/CVF International Confer- ence on Computer V ision (ICCV) , pages 16959–16969, 2025. 1 [32] Kenneth Marino, Mohammad Raste gari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external kno wledge. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2019. 2 , 5 , 3 [33] Jeffri Murrugarra-Llerena, Niu Haoran, Barber K.Suzanne, Hal Daume III, Y ang Trista Cao, and Paola Cascante-Bonilla. Beyond blanket masking: Examining granularity for priv acy protection in images captured by blind and low vision users. Confer ence on Large Langua ge Models (COLM) , 2025. 2 , 3 [34] Tribhuv anesh Orekondy , Bernt Schiele, and Mario Fritz. T o- wards a visual privac y advisor: Understanding and predict- ing priv acy risks in images. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2017. 4 , 6 , 2 , 7 , 9 [35] Kishore Papineni, Salim Roukos, T odd W ard, and W ei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pr oceedings of the 40th Annual Meeting on Association for Computational Linguistics , page 311–318, USA, 2002. 6 [36] Kartik P atwari, Da vid Schneider , Xiaoxiao Sun, Chen-Nee Chuah, Lingjuan L yu, and V ivek Sharma. Rendering-reﬁned stable diffusion for priv acy compliant synthetic data. arXiv pr eprint arXiv:2412.06248 , 2024. 2 [37] Ariadna Quattoni and Antonio T orralba. Recognizing indoor scenes. In IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , pages 413–420. IEEE, 2009. 2 , 5 , 1 , 3 [38] Alec Radford, Jong W ook Kim, Chris Hallacy , Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry , Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger , and Ilya Sutskever . Learning transferable visual models from natural language supervision. In Pr oceedings of the International Confer ence on Machine Learning (ICML) , pages 8748–8763. PMLR, 2021. 5 [39] Daniel Samira, Edan Habler , Y uv al Elovici, and Asaf Shab- tai. V ariance-based membership inference attacks against large-scale image captioning models. In Proceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pages 9210–9219, 2025. 1 [40] C. Schuldt, I. Laptev , and B. Caputo. Recognizing human actions: a local svm approach. In Pr oceedings of the Inter- national Confer ence on P attern Recognition (ICPR) , pages 32–36 V ol.3, 2004. 3 [41] Junjie Shan, Ziqi Zhao, Jialin Lu, Rui Zhang, Siu Ming Y iu, and Ka-Ho Chow . Geminio: Language-guided gradi- ent in version attacks in federated learning. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision (ICCV) , pages 2718–2727, 2025. 1 [42] Karen Simonyan and Andrew Zisserman. V ery deep conv o- lutional networks for large-scale image recognition. arXiv pr eprint arXiv:1409.1556 , 2014. 2 [43] Robin Staab, Mark V ero, Misla v Balunovi ´ c, and Mar- tin V echev . Beyond memorization: V iolating priv acy via inference with large language models. arXiv preprint arXiv:2310.07298 , 2023. 1 [44] Zhenxiong T an, Songhua Liu, Xingyi Y ang, Qiaochu Xue, and Xinchao W ang. Ominicontrol: Minimal and univer - sal control for diffusion transformer . In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision (ICCV) , pages 14940–14950, 2025. 3 , 2 , 4 [45] Zhenxiong T an, Qiaochu Xue, Xingyi Y ang, Songhua Liu, and Xinchao W ang. Ominicontrol2: Efﬁcient conditioning for diffusion transformers. arXiv preprint , 2025. 3 [46] Batuhan T ¨ omekc ¸ e, Mark V ero, Robin Staab, and Martin V eche v . Pri vate attribute inference from images with vision- language models. In Advances in Neural Information Pr o- cessing Systems (NeurIPS) , 2024. 2 [47] Ramakrishna V edantam, C. Lawrence Zitnick, and De vi Parikh. Cider: Consensus-based image description ev alua- tion. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition (CVPR) , 2015. 6 [48] W eiyun W ang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang W ei, Zhaoyang Liu, Linglin Jing, Shenglong Y e, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility , reasoning, and ef ﬁciency . arXiv pr eprint arXiv:2508.18265 , 2025. 2 [49] Zhou W ang, A.C. Bovik, H.R. Sheikh, and E.P . Simoncelli. Image quality assessment: from error visibility to structural similarity . IEEE T ransactions on Ima ge Pr ocessing (TIP) , 13(4):600–612, 2004. 5 [50] Boyi W ei, W eijia Shi, Y angsibo Huang, Noah A Smith, Chiyuan Zhang, Luke Zettlemoyer , Kai Li, and Peter Hen- derson. Evaluating copyright takedo wn methods for lan- guage models. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 37:139114–139150, 2024. 1 [51] Zexuan Y an, Y ue Ma, Chang Zou, W enteng Chen, Qifeng Chen, and Linfeng Zhang. Eedit : Rethinking the spatial and temporal redundancy for efﬁcient image editing. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision (ICCV) , pages 17474–17484, 2025. 3 [52] An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv , et al. Qwen3 technical report. arXiv pr eprint arXiv:2505.09388 , 2025. 2 , 5 , 6 , 3 [53] Kai Zhang, Lingbo Mo, W enhu Chen, Huan Sun, and Y u Su. Magicbrush: A manually annotated dataset for instruction- guided image editing. Advances in Neural Information Pro- cessing Systems (NeurIPS) , 36:31428–31449, 2023. 3 , 5 , 1 , 2 [54] Lvmin Zhang, Anyi Rao, and Maneesh Agra wala. Adding conditional control to text-to-image diffusion models. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision (ICCV) , pages 3836–3847, 2023. 3 [55] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oli ver W ang. The unreasonable effecti veness of deep features as a perceptual metric. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recogni- tion (CVPR) , 2018. 5 , 2 [56] Y uxuan Zhang, Y irui Y uan, Y iren Song, Haofan W ang, and Jiaming Liu. Easycontrol: Adding efﬁcient and ﬂexible control for diffusion transformer . In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision (ICCV) , pages 19513–19524, 2025. 3 [57] Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Ru- jie W u, Kaikai An, Peiyu Y u, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based ﬁne-grained im- age editing at scale. In Advances in Neural Information Pro- cessing Systems (NeurIPS) , pages 3058–3093. Curran Asso- ciates, Inc., 2024. 3 Unsafe2Safe: Contr ollable Image Anonymization f or Downstream Utility Supplementary Material Minh Dinh SouY oung Jin Dartmouth College { Minh.T.Dinh.GR, SouYoung.Jin } @dartmouth.edu S1. Dataset Construction and Prepr ocessing Data usage. All images are sourced from publicly available datasets (e.g., Caltech101 [ 13 ], MIT Indoor Scenes [ 37 ], and MS-COCO [ 29 ]) and used in accordance with their re- spectiv e research usage policies. S1.1. Dataset Choice W e select MS-COCO, Caltech101, and MIT Indoor67 be- cause they provide standardized downstream utility labels while containing div erse real-world identity and contextual cues that require whole-image anonymization beyond facial regions. Many commonly used computer vision datasets, such as CIF AR or iNaturalist, contain limited priv acy-sensitiv e con- tent and therefore do not meaningfully test anonymization methods. In contrast, datasets used by prior anonymization works [ 22 , 25 ] (e.g., CelebA-HQ, FFHQ) primarily ev alu- ate re-identiﬁcation rates and focus heavily on f aces. While these benchmarks are suitable for measuring facial identity remov al, they lack downstream task labels, making it difﬁ- cult to assess the utility–priv acy trade-of f. W eb-scale datasets such as LAION or Flickr are highly representativ e of pri vac y-prone internet imagery and often contain caption annotations. Howe ver , their scale makes large-scale controlled anonymization experiments compu- tationally prohibitiv e, particularly when multi-stage editing and ev aluation are required. W e refrain from using ImageNet because it is likely in- cluded in the pretraining data of the VLMs, diffusion mod- els, and utility backbones used in our pipeline. Such ov erlap could confound ev aluation by introducing unintended mem- orization or distrib ution leakage. Our chosen datasets allo w controlled ev aluation of anonymization quality while mini- mizing potential pretraining bias. S1.2. Filtering of MS-COCO for Content Pr eserv a- tion Because the editing step may sometimes produce results that de viate from the original scene semantics, we apply a CLIP-based ﬁltering step to retain only high-quality edited images. For each edited image x ′ , we compute its CLIP similarity to the corresponding public caption and normal- ize it by the CLIP similarity between the original priv ate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Nor malized CLIP Scor e af ter Anonymization 0 1000 2000 3000 4000 5000 6000 7000 8000 F r equency F iltering Thr eshold Figure S1. Distribution of normalized CLIP scores for private images after anonymization. Most samples cluster around 0.9, indicating strong preservation of original semantics. The red line marks the ﬁltering threshold at 0.7, belo w which edited samples are discarded during editing model training. image x and the same caption: s norm = CLIP ( c public , x ′ ) CLIP ( c public , x ) . (S1) This normalized score measures semantic preservation rela- tiv e to the original image-caption alignment. A value close to 1 indicates that the edited image remains as semanti- cally aligned with the public caption as the original image, whereas lo wer values suggest semantic de gradation or un- intended content changes. Figure S1 shows that the majority of edited samples cluster tightly around 0.9, demonstrating that our anonymization procedure largely preserv es global semantics. The small left-tail corresponds to failure cases where the edit signif- icantly alters scene composition or weakens alignment with the public caption. Such samples can introduce noisy su- pervision signals during training, potentially harming both downstream utility and generati ve stability . W e retain only edited images with s norm > 0 . 7 . As sho wn in the distrib ution, this threshold falls in the lo wer tail and remov es only a small fraction of samples—approximately 7.35% of the original train2014 split (reducing it from 51,401 to 47,623). As a result, semantically degraded edits are excluded while preserving the vast majority of anonymized data. This ﬁltering step improves the consis- tency of the training signal without meaningfully reducing dataset div ersity , thereby balancing semantic ﬁdelity and ro- bustness in the anon ymized training set. Notably , the re- sulting 47,623 anonymized image pairs constitute one of the lar gest publicly constructed before–and–after datasets for image editing [ 53 ], and, to our knowledge, the largest that is explicitly pri vacy-a ware. S2. Implementation Details S2.1. Language models Through out all experiments, we used InternVL2.5 [ 8 ] as the VLM and Qwen3-4B with non-thinking mode (Qwen3- 4B-Instruct-2507) [ 52 ] as the LLM. This choice is mainly empiracal towards a fast b ut robust model. Once obtain- ing the answers from the model, we automatically parse the sections following the structure in the prompt. S2.2. Diffusion model training and generating For FreePr ompt [ 30 ], following the ofﬁcial implementa- tion, we used an empty string as the source prompt and set the SELF REPLACE STEPS ratio to 0.4 to encourage sub- stantial appearance changes. Images were generated at a resolution of 512 × 512 using 50 diffusion steps with a guid- ance scale of 7.5. For FlowEdit [ 24 ], we adopted the SD3 backbone with default hyperparameters provided in the of ﬁcial release. Across all Flo wEdit-based experiments, we used c priv as the caption describing the source image. For InstructPix2Pix [ 7 ], we trained on 4 GPUs with a batch size of 64 and a learning rate of 1 × 10 − 5 . T raining was conducted for 200 epochs with gradient accumulation to an effecti ve batch size of 256. W e initialized all mod- ules with MagicBrush [ 53 ] weights and updated only the UNet parameters. Follo wing the original pipeline, all train- ing images were resized to 256 × 256, while inference w as performed at 512 × 512 using classiﬁer -free guidance scales of 1.5 for the image and 7.5 for the text prompt, with 100 denoising steps. W e used an identical training and inference setup for the modiﬁed InstructPix2Pix equipped with Safe Attention . For initialization, the MagicBrush cross-attention weights were copied into the query , key , value, and output projection layers of the Safe Attention module to ensure compatibility and stable con vergence. For OminiControl [ 44 ], we adopted the subject conﬁg- uration with a batch size of 4. W e used a dummy positional offset of (0 , 0) to disable spatial displacement and trained for 12,000 interv als directly from the FLUX.1 de v [ 26 ] checkpoint. T o the best of our knowledge, this makes our work among the ﬁrst to introduce a priv acy-a ware, FLUX- based image editing model. For DeepPrivacy2 [ 22 ] and F aceAnon [ 25 ], we directly applied the def ault implementations released by the authors to our dataset without modiﬁcation. S2.3. Evaluation metrics f or Stage 1 VISPR attribute grouping. VISPR [ 34 ] provides priv acy annotations using attribute identiﬁers a i . In our ev aluation, priv acy inspection is treated as a binary classiﬁcation task: an image is labeled safe if it has attrib ute a0 safe , and unsafe otherwise. T o analyze performance under speciﬁc priv acy risks, we further ev aluate sev eral attribute groups: • Face: a9 , a10 (complete and partial face) • Health Indicators: a39 , a41 , a43 (physical disability , injury , medicine) • V ehicles: a102 , a103 , a104 (v ehicle o wnership, li- cense plate complete/partial) • Personal Opinion: a61 , a62 (general and political opinions) An image is considered pri vacy-sensiti ve for a giv en group if any attrib ute in the corresponding set is present. Evaluation protocol. For each image, the VLM detector predicts whether pri v acy-sensiti ve content is present under the speciﬁed criterion. Predictions are compared against VISPR annotations, and we report recall, precision, and F1 score. Recall is emphasized because missed detections would allow sensitiv e images to enter the training pipeline. All results are computed on the VISPR test split. S2.4. Evaluation metrics f or Stage 2 S2.4.1. Cheating Scores T o measure unintended preservation of perceptual cues from the original image, we compute the Learned Percep- tual Image Patch Similarity (LPIPS) [ 55 ] with a VGG-16 backbone [ 42 ]. Higher LPIPS indicates stronger de viation from the source image, suggesting that the anonymization process does not simply reconstruct or cop y identity-related features. S2.4.2. Privacy Scor es Face similarity ( FaceSim ) is computed using the Antev elop-v2 face encoder [ 10 ]. For each detected face in the original image, we compute cosine similarity to its nearest counterpart in the anonymized image and retain only the closest match to av oid bias from unmatched pairs. T o ev aluate textual leakage ( T extSim ), we use InternVL2.5-8B [ 8 ] to extract text from anonymized im- ages and compute the token-set similarity using the rapidfuzz package. Lower similarity indicates more ef- fectiv e removal or distortion of identiﬁable te xtual content. For demographic analysis ( Race Entropy) , the same VLM predicts demographic attributes (White, Black, Asian, Hispanic, Other), from which we compute the race entropy metric described in Section 4.3 . T o obtain the VLMScore , we employ InternVL3.5- 8B [ 48 ] as a judge model. The raw and anonymized images are jointly provided to the VLM with a structured prompt asking it to assign a score from 0–100 reﬂecting ho w effec- T able S1. Summary of dataset statistics, utility accuracy , and priv acy-leakage indicators for the ra w and safe subsets. The ta- ble reports training/validation sample counts, top-1 accuracy , and the number of detected text, faces, and race attrib utes. #Samples Utility Privacy Instances Split T rain V al Acc@1 T ext Faces Races Caltech101 [ 13 ] Original 2000 980 94.27 251 82 71 Safe subset 1552 756 83.49 111 24 0 MIT Indoor67 [ 37 ] Original 3991 1317 83.88 551 2184 1332 Safe subset 1605 512 51.12 84 78 1 tiv ely priv acy-sensiti ve attributes hav e been remov ed while preserving scene semantics. S2.4.3. Utility Score Classiﬁcation. For classiﬁcation experiments, we adopt Masked Autoencoders (ImageMAE) [ 19 ] as the backbone and ﬁne-tune a randomly initialized linear classiﬁcation head. All models are initialized from ImageNet-pretrained weights. Training uses batch size 64, learning rate 5 × 10 − 4 , gradient accumulation of 4 (effecti ve batch size 256), and 100 epochs. Data augmentation follo ws the ImageMAE setup using RandAugment with parameters ( n = 2 , m = 9 , mstd = 0 . 5) . Image Captioning. T o e valuate semantic preservation in generativ e tasks, we ﬁne-tune BLIP-2 [ 28 ] on the ﬁltered MS-COCO dataset using the human-annotated captions. Caption quality is ev aluated using BLEU-4 and CIDEr . V isual Question Answering. For VQA, we ﬁne-tune a Qwen3VL-2B [ 52 ] model on question–answer pairs from the OK-VQA dataset [ 32 ] and report answer accuracy . In all tasks, model selection is performed on the anonymized validation set constructed in the same manner as the anonymized training set. Final performance is re- ported on the original test sets to measure preserv ation of task-relev ant semantics under the original data distribution. All experiments are conducted on NVIDIA A100 GPUs. S3. Evaluation of Stage 1: Privacy Inspection S3.1. Is Image Privacy Anonymization Necessary? Giv en the images categorized into safe and unsafe parti- tions, we need to perform a rob ust anon ymization pipeline on the unsafe images while keep the safe images intact. W e ev aluate the extreme setting in which only images ﬂagged as safe by the VLM are used for training. This subset in- evitably contains some false positiv es (images incorrectly ﬂagged as safe) which allo ws us to assess how well the T able S2. Alignment to MS-COCO annotations of different text priors pr oduced by Stage 1 under the FLEUR [ 27 ] met- ric (higher is better). The public caption c pub maintains FLEUR scores close to the priv ate caption c priv , indicating strong preser- vation of global scene semantics despite removing sensiti ve de- tails. The LLM prior , while lower due to the introduction of additional synthesized attrib utes, still retains meaningful semantic alignment. T ext Prior c priv c pub LLM FLEUR ( ↑ ) 80.45 78.93 58.27 model performs without any anon ymization applied to un- safe images. T able S1 summarizes the key statistics. As expected, compared to the model trained on the full (original) dataset, training solely on the safe subset leads to a substantial drop in downstream utility due to the signiﬁcant reduction in data volume and di versity . The remaining samples are highly sanitized and lack many of the visual cues needed for effec- tiv e classiﬁcation, resulting in noticeably weaker accuracy . Howe ver , the pri vacy-leakage signals are correspond- ingly minimal: the VLM detects very few readable texts, faces, or identiﬁable racial attributes in this safe-only set. This conﬁrms that the priv acy detector is conservati ve and effecti ve as most priv acy-sensitiv e images are successfully excluded. At the same time, the sharp utility degrada- tion highlights the necessity of anonymizing unsafe images rather than discarding them, which helps the model to re- cov er both dataset scale and semantic richness while main- taining strong priv acy guarantees. S3.2. Ar e Captions Generated by Our Pipeline Helpful for Utility and Quality Preserv ation? W e provide an e v aluation of the captions produced in Stage 1. Although modern VLMs are generally strong captioners, it is important to quantify ho w well the gener- ated captions align with the underlying visual content, es- pecially when priv ate details are remov ed or re written. T o assess caption ﬁdelity , we use the FLEUR benchmark [ 27 ], which measures consistenc y between the generated caption, the image, and the ﬁ ve human-annotated reference captions from MS-COCO [ 29 ]. W e use the v al2014 split, which is also our test set, to obtain the quality score for c priv , c pub , and LLM . As sho wn in T able S2 , the public caption c pub exhibits only a modest decrease in FLEUR compared to the pri- vate caption c priv . This comparable score indicates that the public captions still capture the essential scene semantics that humans percei ve while successfully omitting pri vacy- sensitiv e content. Notably , the LLM-composed captions also maintain reasonably high alignment despite introduc- ing synthetic, priv acy-safe attributes, demonstrating that our T able S3. Extend utility and privacy comparison for differ ent generative backbones. Our Unsafe2Safe, which uses [ 24 ] as the under- lying image editor , achiev es comparable performance on downstream tasks while successfully anon ymizing pri v acy-sensiti ve information, unlike other anonymization models. The best value (yellow) and the second-best value (blue) are highlighted per column. † denotes the model was ﬁnetuned with our dataset. Model T ext Prior Utility Score Privacy Scor e Accuracy ( ↑ ) F aceSim ( ↓ ) T extSim ( ↓ ) Race Entr opy ( ↑ ) Cal101 Indoor Cal101 Indoor Cal101 Indoor Cal101 Indoor Raw Images – 94.277 83.881 1.0000 1.0000 1.0000 1.0000 0.4384 0.7443 Unsafe2Safe (FlowEdit [ 24 ]) c priv ate 94.334 79.925 0.4378 0.2666 0.6611 0.4210 0.5552 0.7399 class 93.857 80.448 0.4965 0.3743 0.4856 0.2077 0.4051 0.6508 c public 94.487 80.746 0.4881 0.2883 0.5395 0.2896 0.6409 0.7208 c edit 94.792 77.090 0.3658 0.2077 0.5238 0.2393 0.7646 0.7589 LLM ( c edit , c public ) 92.884 80.746 0.3428 0.2294 0.4881 0.2119 0.8751 0.7643 Unsafe2Safe (FreePrompt [ 30 ]) c priv ate 94.926 81.567 0.5026 0.2764 0.6382 0.2757 0.5421 0.6963 class 94.506 79.254 0.5474 0.2966 0.5651 0.2314 0.4991 0.5960 c public 94.849 82.537 0.5085 0.2811 0.5949 0.2423 0.5790 0.6899 c edit 93.857 78.881 0.3693 0.2165 0.5672 0.2361 0.7516 0.8276 LLM ( c edit , c public ) 94.105 80.522 0.4539 0.2159 0.5580 0.2217 0.7967 0.7806 Unsafe2Safe (OminiControl [ 44 ]) † c edit 94.506 80.000 0.3585 0.1925 0.6132 0.3350 0.8425 0.8058 DeepPriv acy2 [ 22 ] – 94.601 84.030 0.3921 0.3547 0.9569 0.8653 0.7315 0.7547 FaceAnonSimple [ 25 ] – 94.849 84.030 0.4586 0.5045 0.9355 0.7701 0.6091 0.7407 edit-instruction generation preserves global scene meaning ev en when enriching the caption with additional identity- neutral details. S4. Additional Results of Stage 2: Safe Image Generation S4.1. Quantitative r esuls In T able S3 , we report the performance of our pipeline when lev eraging FreePrompt [ 30 ] and OminiControl [ 44 ] as the editing diffusion models. These results strengthen our con- clusion that the Unsafe2Safe pro vides a rob ust frame work not only for obtaining pri vac y-preserving edit instructions, but also for curating an effecti ve dataset for teaching a dif- fusion model to perform anonymization. S4.2. Analysis of attention maps from SafeAttention W e further analyze the resulting attention maps to visualize how SafeAttention affects the editing behavior . Follo wing Liu et al. [ 30 ], we compute the averaged attention maps over all dif fusion steps for each of the 16 transformer layers in the UNet. For Safe Cross Attention, we additionally sep- arate the attention maps into the components correspond- ing to c pub and c edit , allowing us to examine how each text source inﬂuences the model. Figure S2 sho ws the attention maps for the Cross At- tention and Safe Cross Attention modules at the 13 th trans- former layer (counting from 1). As illustrated, in vanilla InstructPix2Pix [ 7 ], attention is spread diffusely across the entire image, causing edits to leak into re gions that should remain unchanged. In contrast, Safe Cross Attention pro- duces a clear separation when we inspect the maps per to- ken group: public-caption tokens attend primarily to stable background re gions and task-relev ant objects, while edit- instruction tok ens concentrate their attention on the sen- sitiv e areas identiﬁed for anonymization. This structured attention pattern demonstrates that Safe Cross Attention provides a more controlled and interpretable mechanism for priv acy-preserving editing, facilitating denoising that is both more selecti ve and more f aithful to the intended anonymization beha vior . S4.3. Does Our Method’s Controllability Promote F airness? Throughout the aforementioned experiments, the LLM was free to propose any identity attributes. T o assess demographic controllability , we introduced a simple intervention: we constructed a list of racial groups, and for each image, we uniformly sampled one race and asked the LLM to inte grate it into the edit ideas. The racial groups include White , Black , East Asian , South Asian , Southeast Asian , Middle Eastern/North African , Indigenous/ Pacific Islander , and Hispanic/Latino . W e use demographic predictions as a proxy for div ersity . Figure S3 presents an example under the sampled label Indigenous/Pacific Islander . After applying our pipeline, the anonymized output reconstructs the scene Orig inal Ou tput Cross Attention Ou tpu t Cross A ttentio n Saf e Attent ion (Public tokens) Saf e Atten t ion (Edit tokens) InstructPix2Pix (V ani lla) Safe Cross Atte ntion Figure S2. Attention maps at the 13 th transformer block in the UNet. From left to right: (1) the original private image; (2) the anon ymized output produced by vanilla InstructPix2Pix along with its standard cross-attention map; (3) the anonymized output produced by our Safe Cross Attention model, sho wn with its v anilla cross-attention map, the Safe Attention map corresponding to public-caption tokens, and the Safe Attention map corresponding to edit-instruction tokens. Sampled E thnic: Indigenous/ Pac ific Islande r T arget Pr ompt: Indigenous an d Pacifi c Islander indivi duals gather around a lar ge blue- and - white decor ated cake featuring logos. Three women with bra ided hair and traditional garments, o ne man in a w oven shirt, stand togethe r . The outdoor sett ing now f eatures a gen eric rural land scape. No tattoos , medical items, or text are vi sible. Origin al Anonym ized Figure S3. Example of demographic-controlled anonymization under the randomly sampled “Indigenous / Paciﬁc Islander” con- dition. with entirely new identities whose appearance, such as skin tone, hairstyle, and traditional clothing, aligns with the sampled demographic category . At the same time, the global scene geometry and activity are preserv ed: the indi viduals are still gathered around the same cake, positioned in the same layout, and engaged in the same collectiv e action. The LLM-generated edit instruction also faithfully reﬂects the sampled demographic, producing a coherent description that guides the editor tow ard culturally consistent attrib utes while remo ving sensiti ve cues such as uniforms, text, or identiﬁable faces. This intervention demonstrates that our frame work not only anonymizes identity but also provides ﬁne-grained control over demo- graphic attributes simply via text condition when explicitly requested. S4.4. Qualitative r esults How Are Images Anonymized Differently Across T ext Priors? Figure S4 shows the results when using dif ferent text priors to the FlowEdit backbone [ 24 ]. The result for c edit is omitted because the text prior is not suitable for the model, which expects a description of the target scene rather than edit instructions. W e observe that existing face anonymization frame- works fail to properly address non-human pri vac y concerns and instead prioritize modifying the entire face or body , which can sometimes be missed. In contrast, our method applies editing to the whole image and preserves only the information necessary for downstream tasks. It is important to note that the priors c priv and c class represent unprocessed information that is not fully derived from our Stage 1 and is typically av ailable in common im- age datasets, such as human-annotated captions or class la- bels. These priors are not recommended in our pipeline and tend to either fail to protect sensitiv e attributes or discard valuable di versity in the images. In comparison, safe text prompts including c pub , c edit , and LLM are much more suitable for effecti ve do wnstream learning. This improvement is due to their alignment with Orig inal DP2 Fa ceAno n    Unsafe2Safe  Unsafe2Safe   Unsafe2Safe  Unsafe2Safe Figure S4. Qualitati ve results of our method with different text priors and existing face anonymizers. Our results were generated by the FlowEdit [ 24 ] model. Ori gin al DP 2 In stru c tPix 2P ix Fl owEdit Om iniC ont r ol Face Anon FreeP r ompt Unsafe2Safe Figure S5. Qualitative results of our method with dif ferent generati ve backbones and e xisting face anonymizers. FlowEdit [ 24 ] takes LLM as the target te xt prior , while other models take c edit as the text prior . the original content as well as their ability to preserve ﬁne- grained details and structural dynamics. How Are Images Anonymized Differently Across Gen- erative Backbones? W e also show the quality of anonymization across different backbone diffusion models. Models Finetuned? Maker size = SSIM T ext Prior Figure S6. Multi-dimensional e valuation of priv acy–utility trade-of fs across anonymization models with the Caltech101 dataset [ 13 ]. Each subplot visualizes one priv acy metric (FaceSim, T extSim, or RaceEntropy) against downstream utility (Acc@1). Colors denote different editing backbones, and marker shapes represent the text priors used during generation. Marker size is proportional to the SSIM cheating score, and a black outline indicates models ﬁne-tuned on our anonymization dataset. This visualization highlights ho w ﬁne-tuning and text conditioning jointly inﬂuence priv acy protection, realism, and task performance. Figure S5 sho ws outputs of all 4 dif fusion models with the most compatible text prior as suggested by their authors: FreePrompt with c edit , ﬁnetuned InstructPix2Pix with c edit , FLUX-based OminiControl with c edit , and FlowEdit with the target caption LLM . Reg ardless of the generator choice, our Unsafe2Safe generates images that highly align to the original content while rev ealing little leakage. S5. V isualized T rade-Off Among All Evalua- tion Dimensions Figure S6 summarizes the quantitativ e e valuation across multiple dimensions. Each subplot uses a different privacy indicator on the x-axis (F ace Similarity , T ext Similarity , or Race Entrop y) and downstream utility (top-1 accuracy) on the y-axis. Marker size reﬂects the cheating structural sim- ilarity (SSIM), while color, shape, and outline respecti vely encode the generative backbone, the text prior used during editing, and whether the model was ﬁne-tuned. This consolidated visualization makes the pri vacy- utility-ﬁdelity trade-off directly interpretable. Face-only anonymization baselines perform well on face similarity and achiev e slightly higher classiﬁcation accuracy , but the y fail to remov e lo w-lev el cues, te xtual information, and de- mographic signals, resulting in substantially weaker over - all priv acy protection. In contrast, as re vealed by the scat- terplots, our editing models generally occupy favorable re- gions of the trade-off space: they reduce identiﬁable content while maintaining competitiv e utility and without relying on structural similarity . S6. Prompts to Language Models S6.1. Prompts used in Stage 1 Figures S7 , S8 , S9 , S10 , and S11 illustrate the prompting components that shape how Stage 1 decomposes pri vacy reasoning into a sequence of e xplicit and controllable text- generation steps. Figure S7 deﬁnes the pri vacy criteria used throughout Stage 1 by summarizing VISPR [ 34 ]’ s 67 at- tributes into nine interpretable cate gories. Figure S8 shows the prompt used to obtain the PRIVACY FLAG . This prompt is intentionally conserva- tiv e, def aulting to TRUE under an y ambiguity , to minimize false negativ es, since any missed detection would allo w sen- sitiv e content to pass downstream. The expected output is intentionally short and “explanation-free, ” facilitating fast, large-scale screening across all images. Figure S9 presents the structured captioning prompt, which serves as the backbone of Stage 1. Although the PRIVACY REVIEW is not directly used in later steps, it forces the VLM to e xplicitly enumerate priv acy-rele vant el- ements, thereby making omissions auditable and ensuring that the subsequent priv ate and public captions are grounded in a clear semantic separation. Figure S10 provides the prompt used by the LLM to generate edit instructions. It requires attribute-le vel rewrit- ing (e.g., gender, hair , clothing, body shape, cultural mark- ers) while prohibiting repetition of unchanged details. Im- portantly , the image itself is not provided to the LLM at this stage; the model must reason solely from the pri vac y- preserving caption. Finally , Figure S11 shows the prompt used to merge the public caption and the edit instruction into a single compact description. Despite its simplicity , this step addresses tw o practical constraints: (i) diffusion editors often impose strict token budgets, and (ii) the merged caption enables seamless integration of a v ailable ground-truth labels for preserving task-relev ant semantics during editing. S6.2. Prompts used in e valuation W e report here the prompts used for all VLM-based ev al- uation components. These cover four tasks: (1) priv acy ﬂagging under a restricted set of criteria, (2) extraction of readable text from the image, (3) detection of demo- graphic attrib utes, and (4) judement of anon ymization qual- ity . The corresponding prompts are shown in Figure S12 , Figure S13 , Figure S14 , and Figure S15 , respectiv ely . S7. Deployment W e provide the implementation for a user-friendly Graphi- cal User Interf ace (GUI) of our pipeline via Gradio [ 2 ]. The interface demonstrates ho w an uploaded image is automati- cally processed by our VLM- and LLM-based inspectors to produce safe captions and an anonymized proxy generated by the selected editing model. T o promote accessibility and reproducibility , we plan to publicly release this interface to the research community . - ** People ** , faces (full or partial), visible skin in private contexts, tattoos, or emotional expressions - ** Appearance cues ** : clothing, accessories, or features revealing gender, culture, religion, status, or roles - ** Health indicators ** : casts, wheelchairs, bandages, medications, disabilities, or visible illness/injury - ** Texts or symbols ** : signs, receipts, documents, screens, handwriting, license plates, usernames, phone numbers - ** Locations or timestamps ** : homes, offices, bedrooms, addresses, classrooms, GPS data, clocks, date/time stamps - ** Personal interactions ** : small groups, families, friends, teams, protests, celebrations, or close relationships - ** Professional/school settings ** : uniforms, tools, name tags, diplomas, workspaces, formal events - ** Cultural/political/religious identifiers ** : flags, slogans, attire, ceremonies, posters - ** Private belongings or behaviors ** : hobbies, photos, mail, social media, travel documents, receipts Figure S7. Priv acy-related visual criteria used in the Stage 1 of our pipeline. These criteria are summarized from the set of 67 criteria introduced by VISPR [ 34 ]. SECTION: PRIVACY_FLAG Determine whether the image contains ** any content that could disclose private, sensitive, or personally identifying information ** . Do not include explanations, justification, or extra text. --- Set: ** PRIVACY_FLAG: TRUE ** if the image contains * even partially visible * , * blurred * , or * contextually implied * content from the list below. If uncertain, ambiguous, low-resolution, or partially occluded| ** default to TRUE ** . --- Flag the image if you observe or reasonably infer the presence of: [CRITERIA] --- ** Treat these ambiguous cases as DEFINITE PRIVACY_FLAG: TRUE ** : - Partial faces in mirrors or reflections - Blurred or occluded name tags, signs, or screens - Visible groups of people even without faces - Partially legible receipts or handwritten notes - Niche personal objects (e.g. medals, family photos, memorabilia) Figure S8. Assistant prompt used for the privac y ﬁltering step in Stage 1 . When a priv ate issue in any image is detected, the VLM sets PRIVACY FLAG to TRUE , even when the issues were partially visible, conte xtually implied, ambiguous, or low-resolution cases. Your task is to analyze the image and generate ** structured, privacy-aware captions ** in 3 sections: - ` PRIVACY_REVIEW ` : Identify sensitive or private elements - ` PRIVATE_CAPTION ` : Full factual visual description - ` PUBLIC_CAPTION ` : Safe, anonymized caption suitable for generative training --- SECTION: PRIVACY_REVIEW List up to 10 elements in the image that pose a privacy risk. For each, describe what is visible and why it is sensitive. Use this format: - ITEM: [Short factual phrase of the visible element] - REASON: [Why it is privacy-sensitive, based on the list below] Consider the following privacy risk categories: [CRITERIA] Example: - ITEM: A woman hugging a child in a decorated living room - REASON: Relationship, identity, and private setting --- SECTION: PRIVATE_CAPTION Provide a richly detailed narrative (<= 50 tokens) of everything visible in the image, even if privacy-sensitive. Include: - People, animals, objects, or vehicles and their appearances - Visible logos, brand names, or text - The full environment and any identifying or contextual features --- SECTION: PUBLIC_CAPTION Write a ** safe and fluent caption ** (<= 50 tokens) that: - Rephrases or removes all sensitive elements listed in PRIVACY_REVIEW. - Do NOT include identity, brand names, text, or logos - Focuses on scenery, layout, lighting, objects, and anonymous presence (e.g., \a person" is allowed if not identifiable) - Use fluent, image-grounded language. Do ** not ** censor awkwardly or produce incomplete fragments. The caption should stand alone. Figure S9. User prompts provided to the VLM for generating structured, pri vacy-a ware captions of pri vate images. The prompt consists of three sections: PRIVACY REVIEW , which identiﬁes potentially sensiti ve elements; PRIVATE CAPTION , which of fers a detailed caption of the entire scene; and PUBLIC CAPTION , which produces a concise, pri v acy-compliant caption suitable for use as guidance for diffusion models. You are given a caption that describes a real-world photo: Caption: public_caption Your task is to propose realistic identity modifications while the content of the scene (layout, objects, actions) must remain exactly the same. These descriptions will be applied as editing instructions for a Stable Diffusion model. - Make the modifications significant, specific, and easy to generate by a Stable Diffusion. - Change every person and every object with privacy risks. - For each person, include at least 3 specific changes regarding gender, hair, body shape, skin tone, clothing, or culture. - Neutralize objects or background cues with privacy risks with plain or generic alternatives. The changes may include: [CRITERIA] Rules: - Preserve the current object appearence, layout, or actions. - Output only the modifications, not rewrite of the caption. - Do NOT repeat the unchanged details from the original caption. - Keep it realistic, specific, and <= 45 words continuous, natural phrase fragment Figure S10. Prompt to the LLM for generating edit ideas as instructions conditioned on the public caption. The model proposes identity- neutral but realistic attrib ute modiﬁcations, with a focus on faces, while preserving layout and scene semantics. You are given a caption describing a real-world photo and an edit instruction. Original caption: public_caption Edit instruction: edit_caption Combine them into a single, natural-sounding caption , → that reflects ALL edits while preserving the unchanged parts. Keep the result concise (under 50 words). After combining, the caption has to describe an image of class_name. Figure S11. Prompt for merging the public caption and edit in- struction into a single compact description compatible with token- limited diffusion editors. When a class name or ground-truth an- notation is available, users can simply include it into the prompt for a task-relev ant caption. SECTION: PRIVACY_FLAG Determine whether the image contains any instance of ** criteria ** . Respond with only one line: ** PRIVACY_FLAG: TRUE ** or ** PRIVACY_FLAG: FALSE ** Do not include explanations or additional text. --- ** Set PRIVACY_FLAG: TRUE ** if the image clearly, partially, or contextually depicts any CRITERIA_NAME. If the presence of criteria is uncertain, ambiguous, low-resolution, or partially occluded, ** default to TRUE ** . ** Set PRIVACY_FLAG: FALSE ** if the image does not contain criteria, or if any privacy concern is unrelated to criteria. --- Flag the image as TRUE if you can observe or reasonably infer any of the following: CRITERIA_DESCRIPTION Figure S12. Prompt used to ablate the ﬂagging step with custom, less-inclusiv e privac y criteria. You are an image analysis assistant. Analyze the provided image and extract all visible text that appears anywhere in the scene (including logos, signs, menus, or labels). If no readable text is visible, output exactly: NO_TEXT Otherwise, respond strictly in the following format: ### TEXT Rules: - Include only the text visible in the image - no descriptions or interpretations. - Preserve exact wording, capitalization, and line breaks. - If part of the text is unclear, replace it with [unclear]. - Output exactly one block following the format above. - If no text is visible, output only "NO_TEXT". Figure S13. User prompt used for detecting text in the image. You are an image analysis assistant. Analyze the provided image and identify all apparent genders and races visible in the scene. If no clearly visible person (no recognizable face or body) is present, output exactly: NO_HUMAN Otherwise, respond strictly in the following format: ### GENDER ### RACE Rules: - Use only the categories provided. - Include all that clearly appear, separated by commas. - Do NOT include "unknown" or any explanation. - Output exactly one block following the format above. - If no human is visible, output only "NO_HUMAN". Figure S14. User prompt used for detecting demographical at- tributes (gender and race) in the image. System Prompt: You are a vision-language evaluator for privacy-preserving image anonymization. You must compare the left and right images and judge anonymization strictly based on visual differences. Assign a score from 0 to 100 based on how much that difference reduces the ability to identify or link specific people, objects, or attributesfrom the LEFT image to the RIGHT image. Do not infer or assume sensitive information beyond what is visually identifiable. , → , → , → , → User Prompt: You are given two images shown side by side: - Left image: the original image - Right image: the edited / anonymized image Your task is to assign an anonymization score that measures how much the RIGHT image reduces the ability to link the LEFT image to the same specific identities or personal specifics. , → Privacy-sensitive information includes (but is not limited to): [CRITERIA] You MUST follow this decision process: STEP 1 | Instance identity check: Decide whether the LEFT and RIGHT images depict the same characterstics of privacy-senstive information. , → If all of them they appear to be the same instance with no meaningful visual change, the score will be 0. , → - If people, faces, bodies, clothing, or group composition differ, treat the images as DIFFERENT instances. , → - The same layout, camera angle, or furniture arrangement does NOT imply the same instance. STEP 2 | Unlinkability strength: If the RIGHT image shows a different instance, identity, or attributes, score how strongly this change prevents linking the LEFT image to the same specific details. The general scene, activity, or layout may remain similar. , → , → Preserving scene structure, pose, or activity does NOT reduce the score as long as identity or instance linkability is broken. , → --- ### Scoring (0{100) - ** 0 ** : Same identities or same instance (no anonymization). - ** 20{40 ** : Weak unlinkability (small but real instance or attribute change). - ** 40{70 ** : Moderate unlinkability (multiple identity-related cues changed). - ** 70{90 ** : Strong unlinkability (identities or specifics are hard to link). - ** 90{100 ** : Near-complete unlinkability of identities or personal specifics. High scores are allowed even when scene/layout/activity is preserved, if identities or spersonal specifics are replaced or neutralized. , → Reward evidence of: - people/face identity replacement or strong masking (blur/occlusion/distortion) - removal/replacement of tattoos, distinctive clothing/accessories, name tags - removal/replacement of readable text (documents, screens, plates, usernames) - replacement of distinctive personal objects or environment instance details --- Rules: - Give score 0 for identical pair. - Give score 100 for totally different image pair. - Do NOT penalize preservation of scene structure or activity. - Ignore image quality, realism, or aesthetics. - Do not infer sensitive information that is not clearly visible. Output: ANONYMIZATION_SCORE: Figure S15. Prompt to the VLM-based judge for e valuating anonymization results. If no meaningful anonymization is observed, the judge assigns a score of 0; otherwise, it follows the deﬁned rubics to assess the extent to which instance-speciﬁc identity attributes have been modiﬁed.

Unsafe2Safe: Controllable Image Anonymization for Downstream Utility

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment