Towards Context-Aware Image Anonymization with Multi-Agent Reasoning
Street-level imagery contains personally identifiable information (PII), some of which is context-dependent. Existing anonymization methods either over-process images or miss subtle identifiers, while API-based solutions compromise data sovereignty. …
Authors: Robert Aufschläger, Jakob Folz, Gautam Savaliya
T owards Context-A war e Image Anonymization with Multi-Agent Reasoning Robert Aufschl ¨ ager Jakob F olz Gautam Sav aliya Manjitha D V idanalage Michael Heigl Martin Schramm Deggendorf Institute of T echnology , Deggendorf, Germany robert.aufschlaeger@th-deg.de Abstract Str eet-level imagery contains personally identifiable infor- mation (PII), some of which is context-dependent. Existing anonymization methods either over -pr ocess images or miss subtle identifiers, while API-based solutions compr omise data sovereignty . W e pr esent an agentic framework CA- IAMAR ( C ontext- A war e I mage A nonymization with M ulti- A gent R easoning) for context-awar e PII se gmentation with diffusion-based anonymization, combining pr e-defined pr o- cessing for high-confidence cases with multi-ag ent reasoning for indir ect identifiers. Three specialized a gents coor dinate via r ound-robin speak er selection in a Plan-Do-Check-Act (PDCA) cycle, enabling large vision-langua ge models to classify PII based on spatial conte xt (private vs. public pr op- erty) rather than rigid cate gory rules. The ag ents implement spatially-filter ed coarse-to-fine detection wher e a scout-and- zoom strate gy identifies candidates, open-vocabulary se g- mentation pr ocesses localized cr ops, and I oU -based dedu- plication ( 30% thr eshold) pr events redundant pr ocessing. Modal-specific diffusion guidance with appearance decorr e- lation substantially r educes r e-identification (Re-ID) risks. On CUHK03-NP , our method r educes person Re-ID risk by 73% ( R 1 : 16 . 9% vs. 62 . 4% baseline). F or image qual- ity pr eservation on CityScapes, we achieve KID: 0 . 001 , and FID: 9 . 1 , significantly outperforming existing anonymization. The ag entic workflow detects non-dir ect PII instances acr oss object cate gories, and downstr eam semantic se gmentation is pr eserved. Operating entir ely on-pr emise with open-sour ce models, the frame work generates human-interpr etable audit trails supporting EU’ s GDPR transpar ency r equir ements while flagging failed cases for human r eview . A. Introduction Street-lev el imagery has become ubiquitous through map- ping services, autonomous vehicle datasets, and urban moni- toring systems. While enabling applications in navigation, urban planning, and computer vision research, these images Figure 1. T wo-phase agentic anonymization architecture. Phase 1 employs specialized models for direct PII (full body , license plates). Phase 2 implements multi-agent orchestration with r ound- r obin coordination, where specialized agents handle classification (Auditor), synthesis (Generati ve), and workflo w management (Or- chestrator), implementing PDCA cycles. inherently capture personally identifiable information (PII) at scale. Beyond obvious identifiers like pedestrian faces and license plates, recent work demonstrates that indirect identifiers, including clothing, accessories, and contextual objects, enable Re-ID and group membership inference e ven when direct PII is obscured [ 7 , 37 ]. The challenge is compounded by advances in large vision- language models (L VLMs). State-of-the-art models can in- fer priv ate attributes from contextual cues with up to 76 . 4% accuracy [ 37 ], while cross-modal person re-identification (Re-ID) systems achieve robust performance across chal- lenging cross-dataset and zero-shot scenarios [ 1 , 20 ]. Re- cent foundation model-based multimodal Re-ID frame works can handle uncertain input modalities achie ving competitiv e zero-shot performance without fine-tuning [ 20 ]. Existing solutions fail to address the full scope of this challenge. Monolithic detection pipelines apply uniform processing regardless of context, missing the distinction between pub- lic signage versus pri vate property; black-box approaches provide no audit trail for regulatory compliance; and most critically , current methods lack adaptiv e strategies for the semantic di versity of indirect identifiers. W e propose that ef fectiv e anonymization requires not just better models but a fundamentally different architecture: one that reasons about context, adapts to di verse PII types, and provides transparent decision-making. T o this end, we introduce a multi-agent frame work that orchestrates specialized components through structured communication and iterativ e refinement. Our framew ork addresses the resear ch question : “Can orc hes- trated ag ents achieve conte xt-awar e image anonymization while pr eserving data utility and enabling accountability?” The system (Figure 1 ) combines pre-defined preprocessing for high-confidence direct identifiers with agentic reasoning for context-dependent cases, implementing bounded itera- ti ve refinement through Plan–Do–Check–Act (PDCA) cycles (Shewhart / Deming c ycles) for multi-agent coordination. Our key contrib utions are: 1. Context-aware agentic anonymization with PDCA re- finement. W e present an image anonymization system combining pre-defined preprocessing for direct PII (full bodies, license plates) with multi-agent orchestration for context-dependent identifiers requiring semantic reason- ing. The frame work implements spatially-filtered coarse- to-fine detection via a scout-and-zoom strategy , 30% I oU ov erlap filtering for deduplication, and bounded PDCA cycles with round-robin speaker selection to enable itera- tiv e refinement without infinite loops. 2. On-premise processing with transparent audit trails. Op- erating on-premise with open-source models (with option for APIs) ensures full data sovereignty and verifiable control over processing pipelines. This setup supports GDPR’ s data minimization principle (Art. 5) by eliminat- ing unnecessary data transfers to external or third-party services. The anonymization polic y is adaptable through configurable prompts that guide the auditor agent’ s PII classification according to domain-specific criteria. The multi-agent architecture generates structured, human- interpretable audit trails documenting detection decisions and reasoning chains, supporting transparency (GDPR Art. 13–15), the right to explanation, and meaningful hu- man ov ersight requirements (GDPR Art. 22, Recital 71). Audit trail completeness ultimately depends on L VLM output correctness; the framew ork flags failed or uncer- tain cases for human revie w . 3. Controlled appearance decorrelation for privacy- utility optimization. W e lev erage Stable Diffusion XL (SDXL) [ 30 ] with ControlNet [ 45 ] conditioning for both full bodies and objects to eliminate Re-ID v ectors while preserving utility-relev ant structural information. This modal-specific dif fusion guidance realizes appearance decorrelation that breaks identity-linking features with- out compromising geometry and conte xt. The dif fusion model is used as a replaceable inpainting component without training or fine-tuning, mitigating training data leakage concerns. B. Related W ork Image anonymization has e volv ed to address priv acy regula- tions and street-lev el imagery proliferation, yet gaps remain in cov erage of indirect identifiers, adapti ve detection, valida- tion, and regulatory compliance. B.1. Obfuscation and Generative Methods Con ventional image anonymization approaches, such as EgoBlur [ 31 ], rely on blurring for ef ficiency . Howe ver , these methods insufficiently protect identity information while de- grading downstream task performance. For instance, a 5 . 3% drop in instance segmentation av erage precision ( AP ) on CityScapes [ 6 ] has been reported when applying Gaussian blur [ 15 ]. Moreover , blurring remains vulnerable to inv ersion attacks: 95 . 9% identity retriev al accuracy has been achie ved on blurred CelebA-HQ [ 16 ] face images at 256 × 256 reso- lution [ 44 ], demonstrating that simple obfuscation is inade- quate for modern threat models. Generativ e methods address these limitations by synthesizing realistic replacements for sensitiv e regions, preserving contextual consistency while reducing Re-ID risks. Early deep learning–based meth- ods predominantly use Generativ e Adversarial Networks (GANs). DeepPriv acy [ 14 ] employs a conditional GAN that generates anonymized faces conditioned on pose ke y- points and background context. DeepPriv acy2 [ 13 ] extends this approach to full-body anon ymization via a style-based GAN, improving realism and consistency across human appearance and pose. Recent adv ances lev erage diffusion models, of fering higher fidelity , controllability , and seman- tic alignment. LDF A [ 17 ] applies latent diffusion for face anonymization in street recordings, maintaining both real- ism and priv acy . The F ADM framework [ 48 ] and RAD [ 26 ] employ dif fusion to anonymize persons. SVIA [ 22 ] extends diffusion-based anon ymization to large-scale street-vie w im- agery , ensuring priv acy preserv ation in autonomous dri ving datasets by replacing the semantic cate gories of person, ve- hicle, traf fic sign, road, and b uilding. Re verse Personaliza- tion [ 18 ] introduces a finetuning-free face anonymization framew ork based on conditional dif fusion in version with null-identity embeddings. By applying negati ve classifier- free guidance, it successfully steers the generation process aw ay from identity-defining features while enabling attribute- controllable anonymization for arbitrary subjects. Despite these advances, current generative approaches share ke y limitations: they focus almost exclusi vely on hu- man subjects or are restricted to a set of categories, emplo y single-pass detection and replacement, and lack iterativ e validation or audit mechanisms. B.2. Pri vacy Risks from Large Vision-Language Models and Agentic AI Frontier L VLMs infer private attrib utes at 77 . 6% accuracy from contextual cues [ 37 ], with easily circumv ented safety filters (refusal rates drop to 0% ). Person Re-ID systems exploit not just f aces but clothing, gait, and items [ 1 ]; pose- based methods perform e ven with clothing changes [ 24 ], and cross-modal approaches enable inference under appearance variation [ 20 ]. Agentic multimodal reasoning models esca- late this threat [ 25 ]: OpenAI o3 achiev es 99% metropolitan- lev el (city-region) geolocation accurac y from casual photos by autonomously chaining web search and visual analysis across subtle en vironmental cues like street layouts, archi- tectural styles, and infrastructure that individually appear harmless, with median localization error of 5 . 46 km (T op- 1) and 2 . 73 km (T op-3). When augmented with external tools (web access, image zooming), o3’ s median error drops to 1 . 63 km. This establishes that effecti ve anonymization must address contextual elements enabling priv acy infer- ence, not just direct PII. Beyond visual modalities, recent work demonstrates that LLMs fundamentally alter the eco- nomics of text-based deanonymization. Lermen et al. [ 19 ] propose the ESRC framew ork (Extract, Search, Reason, Cal- ibrate) and show that LLM agents can perform large-scale deanonymization of pseudonymous online accounts, achie v- ing up to 68% recall at 90% precision when linking Hacker News profiles to Link edIn identities. B.3. Agentic Systems f or Computer V ision Recent work has explored the autonomous construction of specialized vision systems through agentic framew orks. ComfyBench [ 43 ] ev aluates LLM-based agents that design collaborativ e AI systems through specialized roles: Plan- ning, Combining, Adaptation, Retriev al and Refinement. ComfyMind [ 9 ] e xtends this foundation through hierarchical search planning with local feedback loops, substantially im- proving task resolution for complex multi-stage w orkflows. RedactOR [ 36 ] demonstrates agent-based de-identification of clinical text and audio with audit trail generation and priv acy compliance. Ho wever , in summary , the existing approaches do not combine multi-agent orchestration with visual anonymization, semantic reasoning about context- dependent identifiers, and bounded iterative validation for comprehensiv e cov erage. C. Method C.1. System Architectur e Our frame work 1 employs a two-phase architecture orches- trating pre-defined and agentic components. Phase 1 applies high-confidence detection models for direct PII (persons, license plates) and traf fic signs. Phase 2 implements multi- agent reasoning for context-dependent identifiers through structured agent communication. The multi-agent system utilizes AutoGen [ 41 ] with round- robin speaker selection, implementing a control flo w: Audi- tor → Orchestrator → Generati ve → Auditor . This rotation ensures predictable execution while enabling iterati ve refine- ment through PDCA cycles. C.2. Phase 1: Pre-defined Pr eprocessing Person Detection and Anonymization. W e employ Y OLOv8m-seg 1 for instance segmentation (confidence threshold τ = 0 . 25 ). The anonymization pipeline operates in three stages: (1) extracts visual descriptions via Qwen2.5- VL-32B [ 2 ], capturing pose, body build, view angle, and actions; (2) generates anonymized prompts by instructing the L VLM to randomly select clothing colors and brightness modifiers from predefined palettes ( 20 colors, 10 brightness lev els) 2 rather than describing observed clothing, while pre- serving utility-relev ant attributes (body b uild, pose, view an- gle, actions) and uniformed personnel context; (3) performs SDXL inpainting at 768 px resolution with OpenPose [ 4 ] ControlNet 3 (conditioning scale α = 0 . 8 , strength ρ = 0 . 9 , guidance scale s = 9 . 0 , 25 denoising steps). The positive and negativ e prompts (excluding the L VLM-generated de- scription) are adapted from [ 48 ]. Critically , we disable all color matching to prev ent appearance correlation and elimi- nate Re-ID vectors. License Plate Processing. Small object detection chal- lenges necessitate specialized handling. W e train Y OLOv8s on UC3M-LP dataset [ 32 ] at 1280 px resolution, achieving mAP 50 − 95 = 0 . 82 . At inference, we apply reduced con- fidence threshold ( τ = 0 . 05 ) prioritizing recall and Non- Maximum Suppression (NMS) with I oU threshold of 0 . 5 to ensure indi vidual plate separation, aligning with con ven- tional NMS practices [ 38 ]. Gauss. blur (radius r = 8 px, kernel 15 × 15 ) obscures alphanumeric content while pre- serving vehicle conte xt. T raffic Sign Exclusion. Y OLO-TS [ 5 ] detects traf fic signs at 1024 px ( τ = 0 . 2 , NMS I oU = 0 . 45 ), generating rectangular exclusion masks pre venting false positiv e PII detection in Phase 2. Signs contain public information and require no anonymization. C.3. Phase 2: Multi-Agent Orchestration Three specialized agents coordinate through AutoGen group chat with strict round-robin speaker selection. The Orches- trator Agent manages workflo w state and enforces bounded iteration ( n max = 3 ). The A uditor Agent performs PII 1 https : / / github . com / ultralytics / ultralytics , Ac- cessed October 25, 2025 2 Color palette: gray , beige, navy , white, black, brown, khaki, blue, red, green, yellow , pink, teal, burgundy , olive, charcoal, maroon, tan, cream, sage. Brightness modifiers: light, dark, bright, faded, vibrant, muted, pale, deep, pastel, bold. 3 thibaud/controlnet-openpose-sdxl-1.0 classification via Qwen2.5-VL-32B and conducts indepen- dent quality validation of anonymized outputs through vi- sual residual detection. The Generative Agent e xecutes scout-and-zoom segmentation, a tw o-stage approach moti- vated by the region proposal paradigm introduced in Faster R-CNN [ 34 ], implements I oU -based deduplication ( 30% threshold) to prev ent redundant reprocessing of ov erlapping instances, and performs modality-specific inpainting using SDXL with Canny ControlNet for object remo val. Complete agent system prompts, tool specifications, PII classification prompts, and inpainting templates are provided in Supple- mentary . T able 1 details agent responsibilities, tool access, and ex ecution protocols. T ool call validation pre vents w orkflow stalls from hallu- cinated completion claims: agents must in vok e functions rather than merely acknowledging instructions. The Orches- trator’ s conv ersation history analysis ensures tool execution results are verified before marking steps complete. C.4. Phase 2: Bounded Iterative Refinement The frame work implements bounded PDCA cycles (Fig- ure 2 ) ensuring coverage while guaranteeing termination. Each cycle executes: (1) Plan –Orchestrator determines which instances require processing; (2) Do –Generati ve per- forms I oU deduplication, then ex ecutes scout-and-zoom segmentation and inpainting; (3) Check –Generativ e vali- dates bbox overlap with processed instances (efficienc y), then Auditor independently validates visual completeness on masked image (quality); (4) Act –Orchestrator decides whether to continue ( n < n max AND residuals exist) or terminate. The dual-layer validation mechanism balances ef- ficiency and quality: Generativ eAgent’ s I oU deduplication prev ents redundant computation when AuditorAgent detects the same instance across iterations ( skip ( bbox i ) = true if max j I oU ( bbox i , bbox j ) ≥ τ where τ = 0 . 3 , false other- wise), while AuditorAgent’ s independent visual inspection ensures anonymization quality . D. Experiments D.1. Experimental Setup Hardwar e. All experiments were conducted on a virtual machine with 12-core AMD EPYC 9534 processor, 64GB RAM, dual NVIDIA L40S GPUs (48GB VRAM each), CUD A 12.9. Models. W e use a dual-GPU setup: GPU 0 runs Y OLOv8 for detection/segmentation, Grounded-SAM-2 [ 23 , 33 ] 4 for precise segmentation, SDXL [ 30 ] with ControlNet [ 45 ] for inpainting, and Qwen-2.5-32B via AutoGen for orchestra- tion (round-robin, 300s timeout). GPU 1 runs Qwen2.5- VL-32B [ 2 ] via Ollama for L VLM-based PII classification. 4 https: // github.com /IDEA - Research/ Grounded- SAM- 2 , Accessed September 8, 2025 PLAN OrchestratorAgent Coordinate workflow DO GenerativeAgent Scout-Zoom-Inpaint CHECK AuditorAgent Classify/Audit PII A CT OrchestratorAgent Iterate/T erminate Instruct anonymization on classified instances anonymize and inpaint() I oU deduplication ( 30% ) audit output() Residual PII? n < 3 ? n ← n + 1 or emit COMPLETE Round-Robin PDCA Cycle Phase 1 (Deterministic): Detect/anonymize persons (YOLOv8-seg), detect/blur license plates (YOLOv8), mask traffic signs (YOLO-TS) Phase 2 (Agentic): classify pii() → round-robin coordination ( n max = 3 ) → log output() Round-robin order: Auditor → Orchestrator → Generative → Auditor Figure 2. Round-robin PDCA coordination with three specialized agents. Phase 1 applies deterministic preprocessing for direct PII. Phase 2 implements bounded iterativ e refinement. Detection uses original resolution. F or Re-ID evaluation, we train ResNet50 [ 10 ] with triplet [ 35 ] and center loss [ 40 ] for 120 epochs using SGD ( lr = 0 . 05 , momentum = 0 . 9 , w eight decay = 5 e − 4 ) with warm-up and step decay . Benchmarks. W e e valuate on di verse benchmarks cov er- ing three key pri vacy-utility aspects: • Image Quality Preservation : CityScapes test set [ 6 ] ( 1 , 525 urban street-scene images recorded in Europe) and CUHK03-NP to quantify visual degradation from anonymization • P erson Re-ID Risk : CUHK03-NP detected [ 47 ] ( 767 train- ing identities with 7 , 365 images; 700 test identities with 1 , 400 query and 5 , 332 gallery images) • PII Detection Quality : V isual Redactions Dataset [ 28 , 29 ] test2017 split ( 2 , 989 street-view images annotated with textual, visual and multimodal attrib utes – 24 categories) Baselines. W e compare against representativ e anonymiza- tion approaches: Gauss. Blur ( σ = 20 , k ernel = 51 , traditional pixel obfuscation), DeepPrivacy2 (DP2) [ 13 ] (GAN-based face synthesis with multimodal truncation [ 27 ]), F ADM [ 48 ] (diffusion-based full-body anonymization), SVIA [ 22 ] (dif fusion-based street recordings anon ymization), and per-category PII segmentation from the V isual Redac- tions Dataset [ 28 , 29 ] with aggregated cate gory predictions into unified masks. D.1.1. Image Quality Pr eservation W e ev aluate image quality on CUHK03-NP detected test us- ing fi ve metrics: MSE (pixel-le vel error), SSIM [ 39 ] (struc- tural similarity via luminance/contrast/structure, range: 0- 1), LPIPS [ 46 ] (perceptual similarity via CNN features), KID [ 3 ] (unbiased distribution similarity via Maximum Mean Discrepancy), and FID [ 12 ] (Gaussian-assumed distri- bution distance in Inception space). Lower v alues indicate T able 1. Agent roles, tool access, and execution protocol in Phase 2 multi-agent workflow . Generativ eAgent implements I oU -based deduplication ( 30% threshold) to prev ent redundant reprocessing, while AuditorAgent provides independent quality v alidation. Agent Role, T ools, and Execution Protocol Orchestrator Role: W orkflow coordinator and state track er . T ools: None (coordination-only , no direct tool access). Protocol: Tracks completion status ( classify pii ✓ , inpaint ✓ , audit ✓ , log ✓ ) via conv ersation history analysis. Enforces validation rule: “ONL Y report success ( ✓ ) if tool result visible in con versation history . ” Coordinates retry logic for failed audits, enforces n max = 3 iteration limit. Issues text-only coordination messages to other agents. A uditor Role: PII classification, quality validation, logging. T ools: classify pii , audit output , log output . Protocol: (1) PII Classification : Qwen2.5-VL-32B with structured prompting, returns bboxes in [ x min , y min , w, h ] format. System prompt prohibits acknowledgment-without-e xecution: “When instructed to use a tool, YOU MUST CALL IT – NEVER just acknowledge. ” Performs scout-and-zoom verification for each classified instance: crops to L VLM bbox, runs Grounded-SAM-2 [ 23 , 33 ], maps mask to full coordinates. (2) Independent Quality V alidation : Creates masked image (processed areas blacked out with 5-iteration morphological dilation, 5 × 5 elliptical kernel), performs L VLM-based residual detection to verify no visible PII remains. This check is independent of GenerativeAgent’ s I oU deduplication, providing dual-layer verification. Generative Role: Anonymization execution via modal-specific inpainting with self-v alidation. T ools: anonymize and inpaint . Protocol: (1) I oU -Based Deduplication : Before processing, computes I oU ( bbox i , bbox j ) for each candidate instance against all processed instances. Instances with I oU ≥ 0 . 3 skipped with iou overlap with processed status, prev enting redundant work when AuditorA- gent detects the same region repeatedly (e.g., IoU= 0 . 88 in berlin 000002 iteration 2). Threshold config. via PII IOU THRESHOLD (default 0 . 3 ). (2) Scout-and-Zoom Se gmentation : Crops to L VLM bbox, runs Grounded-SAM-2 on crop, maps precise mask to full coordinates. Failed segmentations skipped with scout zoom failed no fallback status. (3) Modal-Specific Inpainting : SDXL with Canny ControlNet ( α = 0 . 3 , strength= 0 . 9 , guidance= 9 . 0 , 25 denoising steps) for objects/text using L VLM description as prompt. Person inpainting (SDXL with OpenPose ControlNet, Phase 1) handled deterministically . All color matching disabled (luminance= 0 . 0 , chrominance= 0 . 0 ) to prev ent ap- pearance correlation. (4) F ormat Enforcement : Handles three LLM serialization errors via rob ust parsing: single strings with comma-separated dicts ( " { a:1 } , { b:2 } " ), stringified dict objects ( [" { a:1 } "] ), nested list structures. Format examples enforce JSON dict objects. Round- Robin Execution Protocol: After any agent makes tool call, control transfers to UserProxyAgent for ex ecution, then continues to next agent: Auditor → Orchestrator → Generativ e → Auditor . Empty message detection (2+ empty responses in last 3 messages) triggers skip to next agent. For GenerativeAgent , automatic retry reissues anonymize and inpaint with pending instances. T ermination Dual Conditions: Primary: OrchestratorAgent emits “PIPELINE COMPLETE” (case-insensiti ve, punctuation-tolerant). Fallback: audit.ok=true AND ctx.logged=true . Prev ents infinite loops if orchestrator fails to signal completion. better preservation e xcept for SSIM (higher is better). Figure 3 illustrates qualitative dif ferences across anonymization approaches on a CUHK03-NP example (more examples are plotted in supplementary). Results on CUHK03-NP . T able 2 presents the compar- ison. Our method obtains balanced performance across all metrics, outperforming baselines in distribution preservation while maintaining competiti ve perceptual quality . Compared to F ADM, our full body anonymization approach in phase 1 achiev es stronger visual transformation (higher MSE : 975 . 6 vs. 446 . 4 ) while maintaining comparable structural sim- ilarity ( SSIM : 0 . 699 vs. 0 . 785 ). Crucially , we demon- strate superior distrib ution pr eservation with reduced KID ( 0 . 014 vs. 0 . 032 ) and FID ( 20 . 4 vs. 33 . 3 ), indicating bet- ter preservation of statistical properties. Compared to ag- gressiv e methods like DP2, we reduce perceptual artifacts ( LPIPS : 0 . 203 vs. 0 . 303 ) and improve structural preserv a- tion ( SSIM : 0 . 699 vs. 0 . 443 ) while achie ving substantially better distribution alignment (lower KID and FID ) at a mod- est priv acy trade-off (check Re-ID results below). Gauss. Original G.Blur DP2 F ADM Ours Figure 3. Qualitative comparison of anon ymization methods on a CUHK03-NP test example (bounding box test/0069 c2 658.png). Blur exhibits se vere degradation despite similar pixel-le vel changes ( MSE : 1097 . 9 ): high perceptual distortion ( LPIPS : 0 . 382 ) and catastrophic distribution distortion ( KID , FID worse than ours), confirming that simple obfuscation fails to preserve semantic content. Results on Cityscapes. T able 3 demonstrates our method’ s ef fecti veness on street scene images (phase 1 & 2) 5 . W e produce dramatically superior image quality compared to SVIA across all metrics. For f air comparison with SVIA ’ s 2 × downscaling ev aluation protocol, we downscale both original and anonymized images to 1024 × 512 resolution for metric computation, though our anonymization operates at full resolution. These substantial gains stem from our granular transformations that preserve the v ast majority of scene content unchanged while SVIA ’ s holistic anonymiza- tion approach modifies broader image regions (buildings, road infrastructure, and en vironmental context). Privacy-Utility T rade-off. These metrics reveal comple- mentary pri vac y and utility aspects: Higher MSE and LPIPS values indicate stronger priv acy protection through greater visual differences from originals, making matching more difficult. Lower SSIM reduces structural correlation, further hindering identity recognition. Howe ver , lower KID and FID values ar e beneficial because they indicate anonymized images preserv e statistical properties needed for downstream applications, ev en though individual identities are protected. Our method achiev es optimal balance: suf ficient visual trans- formation to pre vent Re-ID (Section D.1.2 ) while preserving scene structure and distribution properties for continued data utility . T able 2. Image quality metrics on CUHK03-NP detected test splits ( 6 , 732 images: 1 , 400 query + 5 , 332 gallery). Method MSE ↓ SSIM ↑ LPIPS ↓ KID ↓ FID ↓ Gauss. Blur 1097 . 9 0 . 633 0 . 382 0 . 224 178 . 5 DP2 [ 13 ] 3275 . 1 0 . 443 0 . 303 0 . 066 59 . 7 F ADM [ 48 ] 446 . 4 0 . 785 0 . 157 0 . 032 33 . 3 Ours 975 . 6 0 . 699 0 . 203 0 . 014 20 . 4 T able 3. Image quality metrics on Cityscapes test set ( 1 , 525 im- ages). Method LPIPS ↓ KID ↓ FID ↓ SVIA [ 22 ] 0 . 530 0 . 027 44 . 3 Ours 0 . 025 0 . 001 9 . 1 D.1.2. Re-ID Risk Assessment Evaluation Protocol. W e adopt a realistic threat model simu- lating adversarial Re-ID attempts: (1) T rain a ResNet50 [ 10 ] Re-ID model with triplet loss [ 35 ] and center loss [ 40 ] on the original CUHK03-NP detected training set ( 767 identities, 5 SVIA implementation/CityScapes evaluations from https : / / github . com / Viola - Siemens / General - Image - Anonymization , Accessed October 10, 2025 7 , 365 images) for 120 epochs; (2) T est using original query images ( 1 , 400 queries) against anonymized g allery images ( 5 , 332 gallery) with re-ranking [ 47 ]. This simulates an at- tacker with access to original surveillance footage attempting to re-identify indi viduals by matching against anon ymized public releases. High matching accuracy indicates priv acy risk; low accurac y confirms ef fectiv e anonymization. This threat model targets automated Re-ID at scale, the primary vector for mass surveillance. Acquaintances may still rec- ognize subjects through retained structural features (pose, build), which utility-preserving methods must retain; person Re-ID threat models primarily address automated systems, not determined observers with prior knowledge. W e employ standard Re-ID metrics: Rank-1 ( R 1 ) accurac y measures the percentage of queries where the correct identity appears as the top-ranked match in the gallery , representing the success rate of immediate Re-ID. Mean A verage Precision ( mAP ) ev aluates ranking quality across all gallery positions, captur- ing the ov erall retriev al performance. Results. T able 4 demonstrates our method implements substantial Re-ID risk r eduction : only 16 . 9% R 1 accuracy and 13 . 7% mAP , representing 72 . 9% and 79 . 2% reduc- tions from original performance ( 62 . 4% R 1 , 66 . 0% mAP ), respectiv ely . In comparison to F ADM [ 48 ], we reduce per - son Re-ID risk ( R 1 : 16 . 9% vs. 33 . 4% ) and ( mAP : 13 . 7% vs. 32 . 9% ) while improving distrib ution preservation (T a- ble 2 : 56% better KID , 39% better FID ). Aggressive obfus- cation methods (DP2: 8 . 6% R 1 , 4 . 4% mAP ; Gauss. Blur: 9 . 4% R 1 , 6 . 4% mAP ) achie ve lo wer Re-ID rates but at se- vere quality cost. DP2 exhibits worse perceptual distortion ( LPIPS : 0 . 303 vs. 0 . 203 ) and worse distribution distortion ( KID : 0 . 066 vs. 0 . 014 ). Gauss. Blur shows catastrophic degradation with worse distrib ution alignment ( KID : 0 . 224 ). Privacy-Utility T rade-off. The 73% drop in Re-ID R 1 confirms strong identity protection ev en under direct match- ing attacks. By retaining pose and scene context it achiev es an balance: stronger priv acy with better distrib ution fidelity . T able 4. Re-ID risk assessment on CUHK03-NP detected dataset. ResNet50 trained on original data ( 767 IDs, 7 , 365 images), tested with original query ( 1 , 400 images) vs. anonymized g allery ( 5 , 332 images) with re-ranking. Lower scores indicate better priv acy protection. Method mAP (%) ↓ R1 (%) ↓ R5 (%) ↓ Original 66 . 0 62 . 4 73 . 1 Gauss. Blur 6 . 4 − 90% 9 . 4 − 85% 15 . 1 − 79% DP2 [ 13 ] 4 . 4 − 93% 8 . 6 − 86% 12 . 7 − 83% F ADM [ 48 ] 32 . 9 − 50% 33 . 4 − 46% 52 . 9 − 28% Ours 13 . 7 − 79% 16 . 9 − 73% 31 . 0 − 58% D.1.3. PII Detection Quality W e ev aluate detection quality on the V isual Redactions Dataset [ 28 , 29 ], which provides pixel-le vel annotations for priv acy attributes (TEXTU AL, VISU AL, and MUL TI- MOD AL) across 8 , 473 images ( 3 , 873 train, 1 , 611 valida- tion, 2 , 989 test). W e aggreg ate all priv acy-sensiti ve re gions into unified binary PII masks. The dataset-adapted auditor’ s classify pii prompt enumerates all 24 pri vac y attribute types. Complete prompt specifications are in Supp. Evaluation Metrics. For the supervised baseline, we re- port AP at I oU thresholds 0.5:0.95. For unified mask com- parison: Dice (F1) = 2 | P ∩ G | | P | + | G | , IoU = | P ∩ G | | P ∪ G | , and pixel-le vel Pr ec. / Rec. , where P is predicted and G is ground truth. Results. T able 5 shows test set performance. The su- pervised Detectron2 baseline achiev es 75 . 83% D ice and 68 . 71% I oU ( 80 . 71% precision, 78 . 02% recall) when e val- uated on all 2,989 test images. Our zero-shot approach, re- quiring no training data, results in 25 . 78% D ice and 20 . 83% I oU ( 52 . 98% precision, 25 . 85% recall). The performance gap re veals limitations of pure (L)VLM approaches for pixel-precise localization, particularly on high-frequency cate gories. F ace and person body dominate the V isual Reduction Dataset, yet our method struggles with these categories. The precision-recall disparity ( 52 . 98% vs 25 . 85% ) indicates boundary delineation issues rather than semantic confusion: low recall reflects missed or imprecise boundaries that L VLMs struggle to delineate spatially . Qual- itativ ely , the scout-and-zoom (region proposal) mechanism often zooms in excessi vely , cropping out necessary context and causing partial detections. While operating at a more granular lev el with 24 fine-grained categories, the method’ s weakness on abundant visual categories (f aces, bodies) and ov er-aggressi ve re gion proposals heavily impacts aggre gate metrics. These findings motiv ate hybrid architectures rout- ing high-frequency visual PII (faces, bodies) to specialized detectors while using L VLM reasoning for open-vocabulary . T able 5. PII detection on V isual Redactions test set (2,989 images). Method Dice ↑ IoU ↑ Prec./Rec. ↑ Mask R-CNN [ 11 , 21 , 42 ] 75.83 68.71 80.71/78.02 Ours (zero-shot) 25.78 20.83 52.98/25.85 D.1.4. Ablation: Phase 1 vs. Full Pipeline W e compare Phase 1 alone against the full pipeline on 1,525 CityScapes test images. Phase 1 alone (67.8s/img) detects zero indirect PII instances. The full pipeline (133.5s/img) recov ers 1,107 indirect PII instances across 54 distinct object categories that predefined detectors cannot identify: vehicle markings (635 instances, 57.4%), te xtual elements such as rectangular signs (418, 37.8%), identity markers including motorcycles with visible plates (17, 1.5%), and other PII including windo ws with interior views (37, 3.3%). 76% of images con ver ge by n = 2 iterations (bounded by n max = 3 ). D.1.5. Runtime Analysis T able 6 reports per-image processing time av eraged ov er 1,525 CityScapes images. Agent coordination ov erhead is modest (9.9s, 7.4% of total), and the 133.5s/img throughput suits batch curation scenarios. T able 6. Runtime breakdo wn per image (CityScapes, n =1 , 525 ). Component T ime (s) Phase 1: Detection + Inpainting 67.8 Phase 2: L VLM + SAM + Inpaint 55.8 Agent ov erhead 9.9 T otal 133.5 D.1.6. Do wnstream Utility T o verify that image quality preservation translates to down- stream performance, we ev aluate semantic segmentation using SegF ormer on CityScapes test as pseudo-ground truth. Our method achie ves 0.877 mIoU ( − 0 . 123 vs. original), substantially outperforming SVIA (0.478, − 0 . 522 ). Static classes remain near-perfect ( < 0 . 01 drop), while dynamic categories like per son show lar ger impacts (T able 7 ). T able 7. Downstream se gmentation on CityScapes test (SegFormer pseudo-GT). Class Ours (Gap) SVIA (Gap) mIoU 0.877 ( − 0 . 123 ) 0.478 ( − 0 . 522 ) road 0.996 ( − 0 . 005 ) 0.955 ( − 0 . 045 ) building 0.977 ( − 0 . 023 ) 0.677 ( − 0 . 323 ) person 0.778 ( − 0 . 222 ) 0.150 ( − 0 . 850 ) car 0.975 ( − 0 . 025 ) 0.567 ( − 0 . 433 ) sky 0.995 ( − 0 . 005 ) 0.820 ( − 0 . 180 ) D.2. Qualitati ve Examples Figure 4 presents CityScapes test images ( 2048 × 1024 ) with detected PII and anonymized outputs. berlin 000002 000019. Phase 1 (95.2s) detected 8 per- sons and 2 traf fic signs, anon ymizing all persons via batch inpainting with SDXL and OpenPose ControlNet. Phase 2 (207.3s) identified one police v ehicle at bbox [126, 268, 400, 362]. Policy-a ware spatial filtering excluded 8 person masks and 2 traffic sign masks before Grounded-SAM-2 se gmenta- tion produced a 34,222-pixel vehicle mask, inpainted with Canny ControlNet. The agentic workflow ex ecuted 3 PDCA iterations: iteration 1 detected residual PII at bbox [120, Figure 4. Pipeline output on CityScapes test images. Each row shows one street scene. Left: original; middle: detected PII with color-coded o verlays ( blue =persons, yellow =indirect PII v ehicles, green =traffic signs, gray =license plates); right: anonymized output. T op (berlin 000002): 8 persons + 1 police vehicle + 2 traffic signs, 4.97% PII cov erage, 3 PDCA iterations. Bottom (berlin 000472): multiple PII categories across two-phase pipeline. 260, 370, 370]; iteration 2 computed IoU=0.88 with the pro- cessed instance, skipping reprocessing; iteration 3 reached n max = 3 , triggering termination. Final PII mask: 104,143 pixels (4.97% coverage) from 8 person masks (69,921px) and 1 vehicle mask (34,222px). T otal time: 302.5s. The agent dialogue provides explainability through tool calls, bounding boxes, and status transitions. E. Discussion, Limitations, and Conclusion Our framew ork prioritizes regulatory compliance and priv acy protection ov er computational efficiency . Round-robin coor- dination ensures deadlock-free execution b ut scales linearly with scene complexity (133.5s/image on dual L40S GPUs, T able 6 ), precluding real-time deployment. Despite struc- tured output constraints, LLMs exhibit systematic failure modes: acknowledgment-without-ex ecution, format incon- sistency , and premature completion claims. Our mitigations reduce but do not eliminate these risks; we frame audit trails as supporting rather than guaranteeing compliance, flagging uncertain cases for human revie w . The current ev aluation also does not compare the multi-agent PDCA architecture against a single-agent alternati ve (e.g., one L VLM pass with- out iterati ve auditing). The spatial filtering threshold ( 50% I oU ) and overlap heuristics perform robustly on W estern urban imagery but may require adaptation for culturally distinct contexts. PII detection on the V isual Redactions Dataset exposes limitations of zero-shot L VLM approaches for fine-grained localization, particularly for high-frequency categories (faces, bodies), where o ver -aggressiv e scout-and- zoom cropping remov es context needed for accurate detec- tion. This motiv ates hybrid architectures combining seman- tic reasoning with supervised spatial precision. Machine vs. Human Re-ID. Our threat model targets automated Re-ID at scale – the primary mass surveillance vector . The 73% R1 reduction demonstrates ef fectiv e pro- tection against machine matching. Acquaintances may still recognize subjects through retained structural features (pose, build) that utility-preserving methods must retain; person Re-ID threat models primarily address automated systems, not determined observers with prior kno wledge. Ablation Coverage. The current evaluation lacks sys- tematic ablations isolating individual components. W e do not vary the diffusion backend (e.g., Flux vs. SDXL), the L VLM, or the segmentation model, nor explore sensiti vity to ke y hyperparameters ( n max , I oU threshold, ControlNet conditioning scale). While the Phase 1 vs. full pipeline com- parison demonstrates the value of agentic Phase 2 (1,107 ad- ditional indirect PII instances), finer-grained ablations would strengthen the design rationale. The div ersity of recov ered indirect PII provides indirect e vidence for context-aware rea- soning beyond fixed taxonomies, though formal correctness ev aluation against human judgments remains future work. In summary , we presented a multi-agent frame work com- bining pre-defined preprocessing with bounded agentic re- finement for context-aw are anonymization while preserving data utility and enabling accountability through structured audit trails. Evaluations on CUHK03-NP and CityScapes demonstrate Re-ID risk reduction with superior distribution preserv ation. Future w ork includes parallel e xecution, model quantization, region-specific adaptation, hierarchical plan- ning, and broader tool integration for agents. Acknowledgement The research leading to these results is funded by the Ger- man Federal Ministry for Economic Aff airs and Energy within the project “just better D A T A - Effiziente und hochge- naue Datenerzeugung f ¨ ur KI-Anwendungen im Bereich au- tonomes Fahren”. The authors would lik e to thank the con- sortium for the successful cooperation, and Stefanie Merz for her careful revie w of the manuscript. References [1] Robert Aufschl ¨ ager , Y oussef Shoeb, Azarm No wzad, Michael Heigl, Fabian Bally , and Martin Schramm. Following the clues: Experiments on person re-id using cross-modal intelli- gence. arXiv preprint , 2025. 1 , 3 [2] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin W ang, W enbin Ge, Sibo Song, Kai Dang, Peng W ang, Shijie W ang, Jun T ang, et al. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 , 2025. 3 , 4 , 2 [3] Mikołaj Bi ´ nko wski, Doug al J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In International Confer ence on Learning Repr esentations , 2018. 4 [4] Zhe Cao, Gines Hidalgo, T omas Simon, Shih-En W ei, and Y aser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part af finity fields. IEEE T rans. P attern Anal. Mach. Intell. , 43(1):172–186, 2021. 3 , 2 [5] Junzhou Chen, Heqiang Huang, Ronghui Zhang, Nengchao L yu, Y anyong Guo, Hong-Ning Dai, and Hong Y an. Y olo- ts: Real-time traffic sign detection with enhanced accuracy using optimized receptiv e fields and anchor -free fusion. IEEE T ransactions on Intelligent T ransportation Systems , pages 1–17, 2025. 3 , 5 [6] Marius Cordts, Mohamed Omran, Sebastian Ramos, T imo Rehfeld, Markus Enzweiler , Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , 2016. 2 , 4 [7] Matt Franchi, Hauke Sandhaus, Madiha Zahrah Choksi, Sev- erin Engelmann, W endy Ju, and Helen Nissenbaum. Pri- vac y of groups in dense street imagery . In Proceedings of the 2025 A CM Confer ence on F airness, Accountability , and T ranspar ency , pages 2874–2891. Association for Computing Machinery , 2025. 1 [8] A Geiger , P Lenz, C Stiller , and R Urtasun. V ision meets robotics: The kitti dataset. Int. J. Rob. Res. , 32(11): 1231–1237, 2013. 8 [9] Litao Guo, Xinli Xu, Luozhou W ang, Jiantao Lin, Jinsong Zhou, Zixin Zhang, Bolan Su, and Y ing-Cong Chen. Com- fymind: T oward general-purpose generation via tree-based planning and reactiv e feedback. In Advances in Neural Infor - mation Pr ocessing Systems , 2025. 3 [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , 2016. 4 , 6 [11] Kaiming He, Georgia Gkioxari, Piotr Doll ´ ar , and Ross Gir- shick. Mask R-CNN. In Pr oceedings of the IEEE/CVF Inter- national Confer ence on Computer V ision , pages 2961–2969, 2017. 7 [12] Martin Heusel, Hubert Ramsauer , Thomas Unterthiner , Bern- hard Nessler , and Sepp Hochreiter . GANs trained by a two time-scale update rule con ver ge to a local nash equilibrium. Advances in Neural Information Processing Systems , 30, 2017. 4 [13] H ˚ akon Hukkel ˚ as and Frank Lindseth. Deeppriv acy2: T owards realistic full-body anonymization. In IEEE/CVF W inter Con- fer ence on Applications of Computer V ision (W ACV) , pages 1329–1338, 2023. 2 , 4 , 6 [14] H ˚ akon Hukkel ˚ as, Rudolf Mester , and Frank Lindseth. Deep- priv acy: A generati ve adversarial network for face anonymiza- tion. In International Symposium on V isual Computing , pages 565–578. Springer , 2019. 2 [15] H ˚ akon Hukkel ˚ as and Frank Lindseth. Does image anonymiza- tion impact computer vision training? In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition W orkshops , pages 140–150, 2023. 2 [16] T ero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressiv e growing of GANs for improved quality , stabil- ity , and variation. In International Conference on Learning Repr esentations , 2018. 2 [17] Marvin Klemp, Ke vin R ¨ osch, Royden W agner, Jannik Quehl, and Martin Lauer . Ldfa: Latent diffusion f ace anonymization for self-driving applications. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition W orkshops , pages 3199–3205, 2023. 2 [18] Han-W ei Kung, Tuomas V aranka, and Nicu Sebe. Reverse personalization. arXiv preprint , 2025. 2 [19] Simon Lermen, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, and Florian Tram ` er . Lar ge- scale online deanonymization with LLMs. arXiv preprint arXiv:2602.16800 , 2026. 3 [20] He Li, Mang Y e, Ming Zhang, and Bo Du. All in one frame- work for multimodal re-identification in the wild. In Proceed- ings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pages 17459–17469, 2024. 1 , 3 [21] Tsung-Y i Lin, Piotr Doll ´ ar , Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid net- works for object detection. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 2117–2125, 2017. 7 [22] Dongyu Liu, Xuhong W ang, Cen Chen, Y anhao W ang, Shengyue Y ao, and Y ilun Lin. Svia: A street view image anonymization framework for self-driving applications. In IEEE 27th International Confer ence on Intelligent T rans- portation Systems (ITSC) , pages 3567–3574, 2024. 2 , 4 , 6 [23] Shilong Liu, Zhaoyang Zeng, T ianhe Ren, Feng Li, Hao Zhang, Jie Y ang, Qing Jiang, Chunyuan Li, Jianwei Y ang, Hang Su, et al. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. In Eur o- pean Confer ence on Computer V ision , pages 38–55. Springer, 2024. 4 , 5 , 3 [24] Xiangzeng Liu, Kunpeng Liu, Jianfeng Guo, Peipei Zhao, Y ining Quan, and Qiguang Miao. Pose-guided attention learn- ing for cloth-changing person re-identification. IEEE T rans- actions on Multimedia , 26:5490–5498, 2024. 3 [25] W eidi Luo, Qiming Zhang, T ianyu Lu, Xiaogeng Liu, Y ue Zhao, Zhen Xiang, and Chaowei Xiao. Doxing via the lens: Revealing privac y leakage in image geolocation for agentic multi-modal large reasoning model. arXiv preprint arXiv:2504.19373 , 2025. 3 [26] Simon Malm, V iktor R ¨ onnb ¨ ack, Amanda H ˚ akansson, Minh- ha Le, Karol W ojtulewicz, and Niklas Carlsson. Rad: Re- alistic anonymization of images using stable diffusion. In Pr oceedings of the 23r d W orkshop on Privacy in the Elec- tr onic Society , pages 193–211. Association for Computing Machinery , 2024. 2 [27] Ron Mokady , Omer T ov , Michal Y arom, Oran Lang, Inbar Mosseri, T ali Dekel, Daniel Cohen-Or , and Michal Irani. Self- distilled stylegan: T ow ards generation from internet photos. In ACM SIGGRAPH 2022 Confer ence Proceedings . Associa- tion for Computing Machinery , 2022. 4 [28] T ribhuvanesh Orek ondy , Bernt Schiele, and Mario Fritz. T o- wards a visual priv acy advisor: Understanding and predicting priv acy risks in images. In Proceedings of the IEEE/CVF International Confer ence on Computer V ision , pages 3706– 3715, 2017. 4 , 7 [29] T ribhuvanesh Orekondy , Bernt Schiele, and Mario Fritz. Connecting pixels to priv acy and utility: Automatic redac- tion of priv ate information in images. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , 2018. 4 , 7 , 2 [30] Dustin Podell, Zion English, Kyle Lacey , Andreas Blattmann, T im Dockhorn, Jonas M ¨ uller , Joe Penna, and Robin Rombach. SDXL: Improving latent dif fusion models for high-resolution image synthesis. In International Confer ence on Learning Repr esentations , 2024. 2 , 4 [31] Nikhil Raina, Guruprasad Somasundaram, Kang Zheng, Sagar Miglani, Steve Saarinen, Jeff Meissner , Mark Schwesinger , Luis Pesqueira, Ishita Prasad, Edward Miller, et al. Egoblur: Responsible innovation in aria. arXiv pr eprint arXiv:2308.13093 , 2023. 2 [32] ´ Alvaro Ramajo-Ballester , Jos ´ e Mar ´ ıa Armingol Moreno, and Arturo de la Escalera Hueso. Dual license plate recogni- tion and visual features encoding for vehicle identification. Robotics and Autonomous Systems , 172:104608, 2024. 3 , 5 [33] Nikhila Ravi, V alentin Gabeur, Y uan-Ting Hu, Ronghang Hu, Chaitanya Ryali, T engyu Ma, Haitham Khedr , Roman R ¨ adle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos. In International Conference on Learning Repr esentations , 2025. 4 , 5 , 3 [34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: T owards real-time object detection with re- gion proposal networks. Advances in Neural Information Pr ocessing Systems , 28, 2015. 4 [35] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clus- tering. In Proceedings of the IEEE/CVF Confer ence on Com- puter V ision and P attern Recognition , pages 815–823, 2015. 4 , 6 [36] Praphul Singh, Charlotte Dzialo, Jangwon Kim, Sumana Sri- vatsa, Irfan Bulu, Sri Gadde, and Krishnaram Kenthapadi. RedactOR: An LLM-powered frame work for automatic clini- cal data de-identification. In Proceedings of the 63r d Annual Meeting of the Association for Computational Linguistics (V olume 6: Industry T rack) , pages 510–530. Association for Computational Linguistics, 2025. 3 [37] Batuhan T ¨ omek c ¸ e, Mark V ero, Robin Staab, and Martin V echev . Pri vate attrib ute inference from images with vision- language models. Advances in Neural Information Pr ocessing Systems , 37:103619–103651, 2024. 1 , 3 [38] Lachlan T ychsen-Smith and Lars Petersson. Improving object localization with fitness nms and bounded iou loss. In Pr o- ceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 6877–6885, 2018. 3 [39] Zhou W ang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity . IEEE T ransactions on Ima ge Pr ocessing , 13(4):600–612, 2004. 4 [40] Y andong W en, Kaipeng Zhang, Zhifeng Li, and Y u Qiao. A discriminativ e feature learning approach for deep face recog- nition. In Computer V ision – ECCV 2016 , pages 499–515. Springer International Publishing, 2016. 4 , 6 [41] Qingyun W u, Gagan Bansal, Jie yu Zhang, Y iran W u, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen LLM applica- tions via multi-agent con versations. In Fir st Confer ence on Language Modeling , 2024. 3 , 1 [42] Y uxin W u, Alexander Kirillov , Francisco Massa, W an-Y en Lo, and Ross Girshick. Detectron2. https: // github. com/facebookresearch/detectron2 , 2019. 7 [43] Xiangyuan Xue, Zeyu Lu, Di Huang, Zidong W ang, W anli Ouyang, and Lei Bai. Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborati ve ai systems. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pages 24614– 24624, 2025. 3 [44] Haoyu Zhai, Shuo W ang, Pirouz Naghavi, Qingying Hao, and Gang W ang. Restoring gaussian blurred face images for deanonymization attacks. arXiv preprint , 2025. 2 [45] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image dif fusion models. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision , pages 3836–3847, 2023. 2 , 4 [46] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oli ver W ang. The unreasonable ef fecti veness of deep fea- tures as a perceptual metric. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 586–595, 2018. 4 [47] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re- ranking person re-identification with k-reciprocal encoding. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 3652–3661, 2017. 4 , 6 [48] Pascal Zwick, K evin Roesch, Marvin Klemp, and Oli ver Bringmann. Context-a ware full body anonymization. In Computer V ision – ECCV 2024 W orkshops , pages 36–52, Cham, 2025. Springer Nature Switzerland. 2 , 3 , 4 , 6 Supplementary Material Abstract This supplementary material pr ovides compr ehensive techni- cal details for our multi-ag ent image anonymization fr ame- work Context-A ware Image Anonymization with Multi- Agent Reasoning (CAIAMAR) . W e pr esent complete ag ent system specifications, model configurations, justifications for design decisions, and examples. S1. Multi-Agent System Architectur e S1.1. OrchestratorAgent: W orkflow Coordination The Or chestr atorAgent coordinates Phase 2 anonymization, operating under a round-robin communication protocol using Autogen [ 41 ]. Given that Phase 1 pre-defined processing has completed (person anonymization, license plate anon ymiza- tion, traffic sign masking), the orchestrator manages the iterativ e PII classification and remediation workflo w . System Specification. The orchestrator agent maintains workflo w state S = { s classify , s inpaint , s audit , s log } where each s i ∈ { 0 , 1 } indicates completion status. The communication protocol follows a strict cycle: Auditor → Orchestrator → Generativ e → Auditor. System Prompt for OrchestratorAgent: You are the OrchestratorAgent coordinating Phase 2 of an image anonymization pipeline. Phase 1 (deterministic) has already completed: persons detected/ inpainted, license plates detected/blurred, traffic signs detected/masked. CRITICAL: YOU HAVE NO TOOLS. You coordinate OTHER agents who have tools. NEVER attempt to call functions yourself. ROUND-ROBIN FLOW (automatic speaker rotation): AuditorAgent -> OrchestratorAgent -> GenerativeAgent -> AuditorAgent (loop) You receive control after each agent completes their task. Interpret results and guide the next agent. REQUIRED WORKFLOW for Phase 2: 1. classify_pii -> 2. anonymize_and_inpaint (if PII found) -> 3. audit_output -> 4. log_output SHORTCUT: If classify_pii finds NO PII and Phase 1 completed -> emit ’PIPELINE COMPLETE’ immediately YOUR ROLE: - Monitor tool results IN THE CONVERSATION and track workflow state: [classify_pii ✓ , inpaint_pii ✓ , audit ✓ , log ✓ ] - When AuditorAgent returns results, analyze them and instruct the NEXT agent in round-robin - When GenerativeAgent completes, provide brief status (next agent gets control automatically) - If NO indirect PII found in classify_pii: emit ’PIPELINE COMPLETE’ (Phase 1 already handled persons/plates) - Announce clear phase transitions with progress indicators - NEVER call tools yourself - you are a coordinator only CRITICAL VALIDATION RULES: 1. ONLY report success ( ✓ ) if you can SEE the tool result in conversation history 2. If agent didn’t call tool yet, instruct them to call it - don’t claim it’s done 3. Look for tool execution results (JSON responses) before marking steps complete 4. NEVER assume a tool succeeded just because an agent acknowledged - verify the result RETRY LOGIC: - If audit finds residual PII: say ’Found N residuals. GenerativeAgent, process the residual items from audit output.’ (GenerativeAgent will extract the ’residual’ array from the tool output) - If no residuals OR max_attempts_reached=true: say ’Audit complete. AuditorAgent, log the output.’ - Audit attempts capped at 3. After 3rd failed audit, proceed to logging - DO NOT manually format or stringify instance data - let GenerativeAgent read it from tool results COMMUNICATION STYLE - BE CONCISE: - Give DIRECT instructions to the NEXT agent in round-robin ONLY - DO NOT repeat or echo what other agents said - ROUND-ROBIN ORDER: Your next message goes to GenerativeAgent (unless workflow complete) - Keep messages under 3 sentences - NEVER call functions - just describe what should happen - NEVER copy/paste instance data - reference tool outputs instead - ONLY use checkmarks ( ✓ ) when you’ve VERIFIED tool execution in conversation CONTEXT: Phase 1 already handled persons, license plates, traffic signs Focus on indirect PII requiring vision analysis (text, signs, logos in private spaces) S1.2. A uditorAgent: Detection and V erification The AuditorAg ent implements a three-stage pipeline: (1) PII classification via large vision-language model, (2) quality assurance through iterati ve auditing, and (3) result logging. This agent serves as both the entry point for Phase 2 process- ing and the quality gate before pipeline completion. Functional Specification. The agent exposes three tools: T classify , T audit , and T log . Classification employs spatial fil- tering to eliminate redundant detections. Auditing creates masked images to pre vent false positiv es from L VLM re- detection of anonymized re gions. System Prompt for AuditorAgent: You are the AuditorAgent responsible for PII detection, quality assurance, and logging. Tools: classify_pii, audit_output, log_output CRITICAL EXECUTION RULES: 1. When instructed to use a tool, YOU MUST CALL IT in your response 2. NEVER just acknowledge without calling the tool 3. Each response should contain EXACTLY ONE tool call 4. Tool calls are JSON function calls, not text descriptions ROUND-ROBIN: You receive control after GenerativeAgent completes OR at workflow start. Execute your task, then control passes to OrchestratorAgent. YOUR TASKS: 1. classify_pii: Detect indirect PII in private spaces (text on windows, house numbers, personal items visible indoors) - Call with: classify_pii(image=’’) 2. audit_output: Verify no residual PII remains after anonymization - Call with: audit_output(output=’{canonical_path}’) 3. log_output: Record final results - Call with: log_output(image=’’, output=’{canonical_path}’) EXAMPLE WORKFLOW: When OrchestratorAgent says: ’AuditorAgent: Please classify any remaining PII ’ YOU MUST respond with the tool call (not text explanation): classify_pii(image=’artifacts/data/CityScapes/.../image.png’) REPORTING - BE CONCISE: After tool execution, provide brief summary: - After classify_pii: ’Found X instances.’ (tool output visible to OrchestratorAgent) - After audit_output: ’Audit passed.’ OR ’Audit failed: X residuals.’ - After log_output: ’Logged.’ BE CONCISE: - DO NOT repeat instructions from OrchestratorAgent - DO NOT tell other agents what to do - DO NOT wait for instructions - if OrchestratorAgent told you to do something, DO IT - Keep responses under 1 sentence AFTER tool execution IMPORTANT: - Report EXACTLY what tool returns - don’t hallucinate instances - Only call one tool per turn, then stop - Focus classify_pii on private property only (persons/plates/signs already handled) - If you don’t call the tool when instructed, the workflow will fail S1.3. GenerativeAgent: Inpainting Execution The GenerativeAg ent implements the anonymization oper- ation through dif fusion-based inpainting. Operating within the round-robin protocol, this agent receives PII instance specifications from the orchestrator and e xecutes anon ymiza- tion. Operational Model. Given a set of PII instances I = { i 1 , . . . , i n } where each i j contains spatial coordinates and semantic descriptors, the agent inv okes T inpaint ( I ) which performs unified batch processing via Stable Diffusion XL (SDXL) [ 30 ] with appropriate ControlNet conditioning (OpenPose [ 4 ] for persons, Canny for objects). System Prompt for GenerativeAgent: You are the GenerativeAgent responsible for anonymizing PII via inpainting. Tool: anonymize_and_inpaint CRITICAL EXECUTION RULES: 1. When instructed to anonymize, YOU MUST CALL anonymize_and_inpaint in your response 2. NEVER just acknowledge without calling the tool 3. Tool call is a JSON function call, not a text description 4. Extract instances from previous tool output (classify_pii or audit_output) ROUND-ROBIN: You receive control after OrchestratorAgent. Execute the task, then control passes to AuditorAgent. EXECUTION WORKFLOW: 1. Look at the most recent tool output in the conversation history - If classify_pii was called: find the JSON output and extract the ’instances’ array - If audit_output was called: find the JSON output and extract the ’residual’ array 2. Pass each dict object from that array as a separate element 3. Call anonymize_and_inpaint with the array of dict objects 4. Report results BRIEFLY: ’Processed X items.’ CRITICAL DATA FORMAT - COMMON MISTAKES: CORRECT (array of dict objects as JSON): anonymize_and_inpaint(instances=[ {"det_prompt": "van with text", "description": "van", "bbox": [308, 200, 564, 567]}, {"det_prompt": "blue sign", "description": "sign", "bbox": [215, 256, 294, 42]} ]) WRONG (single string containing all dicts - THIS IS THE MOST COMMON ERROR): anonymize_and_inpaint(instances=[ "{’det_prompt’: ’van’, ’bbox’: [...]}, {’det_prompt’: ’sign’, ’bbox’: [...]}" ]) -> DO NOT stringify the array! Pass actual JSON dict objects! WRONG (stringified dict objects): anonymize_and_inpaint(instances=["{’det_prompt’: ’...’, ’bbox’: [...]}"]) -> Each element must be a dict object, not a string EXAMPLE EXTRACTION: Given audit_output returns: {’residual’: [{’det_prompt’: ’van’, ’bbox’: [308, 200, 564, 567]}, {’det_prompt’: ’sign’, ’bbox’: [215, 256, 294, 42]}]} YOU MUST call: anonymize_and_inpaint(instances=[ {"det_prompt": "van", "bbox": [308, 200, 564, 567]}, {"det_prompt": "sign", "bbox": [215, 256, 294, 42]} ]) NOT this: anonymize_and_inpaint(instances=["{’det_prompt’: ’van’, ’bbox’: [308, 200, 564, 567]}, {’det_prompt’: ’sign’, ’bbox’: [215, 256, 294, 42]}"]) BE CONCISE: - DO NOT repeat OrchestratorAgent’s instructions - DO NOT tell other agents what to do - DO NOT wait for instructions - if OrchestratorAgent told you to do something, DO IT - Keep responses under 1 sentence AFTER tool execution IMPORTANT: - Report EXACTLY what tool returns - don’t hallucinate instances - Only call one tool per turn, then stop - Focus classify_pii on private property only (persons/plates/signs already handled) - If you don’t call the tool when instructed, the workflow will fail S2 . Large V ision-Language Model Specification S2.1. PII Classification Protocol W e employ Qwen2.5-VL-32B [ 2 ] for context-aware PII classification with the following complete system prompt (CityScapes [ 6 ]): System Prompt for PII Detection (Qwen2.5-VL-32B): You are a PII detection system. Identify text, numbers, visual elements, and objects revealing personal/private information. Return ONLY valid JSON. No markdown, explanations, or additional text. DETECT (Max 5 instances - most sensitive/prominent) Vehicles: ONLY with identifying features (entire vehicle body) - WITH text, logos, company names, decals, wraps, or custom graphics - Rare, distinctive, or modified (custom paint, unique damage, modifications) - Commercial/branded (delivery vans, company trucks, emergency vehicles) - EXCLUDE: Common generic vehicles without distinguishing marks Text/Numbers: URLs, social media handles, names, QR codes, barcodes Documents: Flyers, posters, notices with identifying information Signs/Plaques: Business names, contact info, billboards, real estate signs Visual Privacy: Windows showing private spaces with identifiable belongings GROUPING: Adjacent text on same surface = ONE instance. EXCLUDE: Public traffic signs, pre-anonymized areas. DESCRIPTIONS - ANONYMIZE ALL Use ONLY generic terms without ANY identifying details. Format: [generic_object_type] with [basic_visual_features] Rules for ALL descriptions: - Generic types only: "vehicle", "sign", "poster", "window", "board" - Basic colors (dark, light, white, black, blue, red) and shapes (rectangular, square, circular) - Can mention HAS text/graphics but NEVER the actual content - NO proper nouns, brands, makes/models, addresses, phone numbers, or specific details Examples: GOOD: "vehicle with markings", "rectangular sign with text", "window with interior view" BAD: "regular sedan" (not identifiable - skip), "Toyota Camry" (make/model), "sign saying ’Open’" (specific content), "window showing bedroom with family photos" (specific details) BBOX: [x_min, y_min, width, height] - MANDATORY with 50% margin - Tight bbox -> expand width/height by 1.5x -> center expansion -> clamp to bounds - Example: [500, 200, 400, 300] -> [400, 125, 600, 450] - Vehicles: entire vehicle body (all visible panels) + 50% margin JSON FORMAT (EXACT STRUCTURE - no markdown/code blocks) { "instances": [ { "description": "vehicle with markings", "bbox": [400, 125, 600, 450] } ] } Requirements: - EVERY instance MUST have BOTH "description" (string) AND "bbox" (4 integers: [x_min, y_min, width, height]) - If you cannot determine bbox, DO NOT include that instance - Empty result: {"instances": []} - Your response MUST start with { and end with } DETECTION PROCESS: 1. Scan image for PII elements (vehicles with identifying features, text, signs, windows) 2. Vehicles: Include ONLY if has text/logos/decals OR is rare/distinctive/ modified (skip generic vehicles) 3. Text/signs: Include ONLY if reveals private information 4. Group adjacent text on same surface 5. Select top 5 most sensitive (priority: identifiable vehicles > personal text > signs > other PII) 6. For EACH: Locate bbox, describe with anonymous generic terms, expand bbox 50%, verify both fields present 7. Return valid JSON: {"instances": [{"description": "...", "bbox": [...]}]} For PII Se gmentation on V isual Redaction Dataset [ 29 ] we use the following prompt: You are a PII detection system. Identify text, numbers, visual elements, and objects revealing personal/private information. Return ONLY valid JSON. No markdown, explanations, or additional text. DETECT (Max 5 instances - most sensitive/prominent) TEXTUAL PII (visible text/numbers): - Name (full names, first/last names) - Phone Number (phone numbers, contact info) - Home Address (residential addresses, street addresses) - Email Address (email addresses, social media handles) - Birth Date (dates of birth) - Location (geographic locations, landmarks with context) - Date/Time (timestamps, dates with identifying context) VISUAL PII (visible in image): - Face (human faces, full or partial) - License Plate (vehicle license plates) - Person (full person body when identifiable) - Nudity (exposed body parts) - Handwriting (handwritten text) - Physical Disability (visible disabilities/medical conditions) - Medical History (medical documents, prescriptions, health info) - Fingerprint (visible fingerprints) - Signature (handwritten signatures) MULTIMODAL PII (documents/objects): - Credit Card (credit cards, debit cards) - Passport (passports, travel documents) - Driver’s License (driver’s licenses, state IDs) - Student ID (student IDs, school cards) - Mail (addressed mail, envelopes with addresses) - Receipt (receipts with personal info) - Ticket (tickets with names/seats/dates) - Landmark (landmarks with identifying context revealing location) GROUPING: Adjacent text on same surface = ONE instance. PRIORITY: faces > documents > names/addresses > signatures > plates > medical > personal info EXCLUDE: Public signs without personal info, generic objects, pre-anonymized areas. DESCRIPTIONS - ANONYMIZE ALL Use ONLY generic terms without ANY identifying details. Format: [generic_pii_type] with [basic_visual_features] Rules for ALL descriptions: - Use PII category names: "face", "name", "phone number", "address", "license plate", "card", "document", "handwriting", "signature", "person" - Basic visual features: colors (dark, light), shapes (rectangular, circular), "with text", "with photo" - Can mention HAS text/graphics but NEVER the actual content - NO proper nouns, specific names, numbers, addresses, or identifying details Examples: ✓ GOOD: "face", "name with address", "license plate", "card with text", " handwriting on document", "signature", "person" x BAD: "John Smith", "California license ABC123", "Visa card 4532", "sign saying ’123 Main St’" BBOX: [x_min, y_min, width, height] - MANDATORY with 50% margin - Calculate tight bbox → expand width/height by 1.5x → center expansion → clamp to [0, image_bounds] - Example: tight [500, 200, 400, 300] → expanded [400, 125, 600, 450] - Faces: include full head with surrounding context - Documents/cards: include entire document + margins - Vehicles: entire visible vehicle body + 50% margin - Text regions: include all adjacent text on same surface JSON FORMAT (EXACT STRUCTURE - no markdown/code blocks) { "instances": [ { "description": "face", "bbox": [150, 80, 300, 380] }, { "description": "card with text", "bbox": [400, 500, 450, 340] } ] } Requirements: - EVERY instance MUST have BOTH "description" (string) AND "bbox" (4 integers: [x_min, y_min, width, height]) - If you cannot determine bbox, DO NOT include that instance - Empty result: {"instances": []} - Your response MUST start with { and end with } - Keep descriptions SHORT (2-5 words) DETECTION PROCESS: 1. Scan image for all PII types (faces, documents, text with names/addresses/ phone, signatures, plates, medical info, cards) 2. Group adjacent text on same surface (e.g., name + address on envelope = one instance) 3. Rank by sensitivity: faces > documents (passport/ID/cards) > names/addresses > signatures > plates > medical > other PII 4. Select top 5 most sensitive/prominent instances 5. For EACH: Locate bbox, describe with generic PII category (2-5 words), expand bbox 50%, verify both fields present 6. Return valid JSON: {"instances": [{"description": "...", "bbox": [...]}]} Classification Objective. Given an image I and existing mask set M , identify context-dependent PII instances P = { p 1 , . . . , p k } where each p i = ( desc i , bbox i ) represents a physical description and spatial location. Key Design Decisions. • Maximum 3 instances : Prevents hallucination and fo- cuses on most critical PII • V ehicle-lev el bounding boxes : Entire v ehicle anonymized when any PII detected, pre vents partial information leak- age • Grouping rule : Multi-line text treated as single instance reduces fragmentation • Priority ranking : V ehicles prioritized due to dynamic nature and high identifiability • Bounding box f ormat : [ x min , y min , w , h ] in image coor - dinates for direct use with detection models • Physical descriptions only : Prevents model from copying sensitiv e text, forces focus on visual appearance • Exclusion filters : Persons, license plates, traffi c signs handled by Phase 1 deterministic processing S3. Algorithmic Implementations S3.1. T ool Specifications: A uditorAgent S3.1.1. classify pii PII Classification Protocol. The classify pii tool employs Qwen2.5-VL-32B with structured prompting to identify indirect PII instances (text, signage, logos, conte xt- dependent elements) not captured by Phase 1 detection. The L VLM returns instances with detection prompts, semantic de- scriptions, and bounding boxes in [ x min , y min , w , h ] format. For each classified instance, the tool performs scout-and- zoom verification: crops to L VLM bbox, runs Grounded- SAM-2 [ 23 , 33 ] on the crop, and maps the resulting mask to full image coordinates. Post-detection filtering (after first anonymizations e xcludes instances with ≥ 50% spatial ov er- lap against already-processed masks (persons, license plates, traffic signs, pre viously inpainted regions) to prev ent redun- dant work. This adapti ve classification operates without pre-defined category constraints. S3.1.2. audit output: Quality V erification Masked V alidation Protocol. The audit output tool performs quality verification with pre-detection suppression. T o prevent false positiv es from L VLM re-detection of already anonymized regions, the tool constructs a synthetic mask ed image I masked where all processed areas are replaced with black pixels. The L VLM classifier operates on I masked to identify residual PII instances P residual = L VLM ( I masked , θ classify ) . The function returns success status (true when |P residual | = 0 ), the list of detected residual instances, current iteration count, and a termination flag indicating whether maximum attempts ( t ≥ 3 ) were reached. This pre-detection suppression eliminates false positiv es on anonymized areas while the iteration limit prev ents infinite retry loops. S3.1.3. log output: Result Persistence The log output tool persists anonymization results and marks pipeline completion. Gi ven the original image path and final anonymized output path, the tool saves result meta- data and returns success status indicators. S3.2. T ool Specifications: GenerativeAgent S3.2.1. anonymize and inpaint Algorithmic Design. The anonymization function T inpaint : I → I out processes instance set I via diffusion-based in- painting with Canny edge guidance. For each instance i j ∈ I : 1. IoU-based deduplication : Check o verlap with processed instances (threshold τ = 0 . 3 ), skip if redundant 2. Scout-and-zoom segmentation : Apply Grounded-SAM- 2 on cropped region (20% mar gin) to obtain precise mask m j 3. Coordinate mapping : T ransform m j to full-image coor- dinates as M j 4. Diffusion inpainting : Apply SDXL with Canny Control- Net conditioning at 768 × 768 resolution Configuration Parameters. The dif fusion pipeline oper- ates with: • Denoising strength: α = 0 . 9 (strong modification) • Sampling steps: T = 25 (SDXL) • ControlNet scale: λ canny = 0 . 3 (moderate edge preserva- tion) • Guidance scale: s = 9 . 0 • Color matching: disabled ( β = 0 . 0 ) • IoU threshold: τ = 0 . 3 (30% overlap for deduplication) Implementation. The anonymize and inpaint tool processes instances sequentially (not batched) through SDXL + Canny Con- trolNet anonymization. F or each instance i j , the tool first checks IoU ov erlap against all processed instances using configurable threshold PII IOU THRESHOLD (de- fault 0 . 3 ). Instances with high overlap are skipped with status iou overlap with processed , prev ent- ing redundant reprocessing when AuditorAgent detects the same region across iterations. For novel instances, scout-and-zoom segmentation crops region R j with 20% margin, applies Grounded-SAM-2, and maps the result- ing mask m j to full-image coordinates as M j . SDXL then inpaints the masked region at 768 × 768 reso- lution. Failed segmentations are skipped with status scout zoom failed no fallback . The function re- turns counts of successfully processed, skipped (IoU o verlap or segmentation failures), and failed (inpainting errors) in- stances, along with per-instance status details and ov erall completion indicator . S3.3. Phase 1 Deterministic T ools S3.3.1. segment persons: Y OLOv8m-seg Instance Seg- mentation Architectur e. W e employ Y OLOv8m-seg 6 , a medium- scale one-stage detector with instance segmentation capa- bility . The model provides real-time performance while maintaining high accuracy for person detection. Post-pr ocessing. Masks use tight YOLO segmentation boundaries: • Morphological dilation: Disabled • Effect: Precise mask boundaries from Y OLO segmenta- tion without expansion • Rationale: SDXL inpainting handles edge blending with- out requiring dilated masks Implementation. The segment persons tool ap- plies Y OLOv8m-seg for person instance segmentation. The model uses a confidence threshold τ c = 0 . 25 balancing re- call and precision for priv acy protection. Detected person masks are used directly without morphological dilation, re- lying on YOLO’ s accurate segmentation boundaries. The function returns detection count, binary masks, bounding boxes, and confidence scores. YOLOv8m provides a prac- tical balance between processing speed and accuracy for real-time person detection. S3.3.2. anonymize and inpaint persons: Three-Stage Pipeline Pipeline Architectur e. Person anonymization employs a cascaded L VLM-L VLM-diffusion architecture to eliminate appearance correlation while preserving contextual plausi- bility: 1. V isual description (Qwen2.5-VL-32B [ 2 ]): Extract se- mantic attributes A i = { pose , clothing , action } for each person i 2. Attribute diversification (Qwen2.5-32B): T ransform A i → A ′ i by sampling colors from palette C with uniform distribution 3. Conditioned inpainting (SDXL [ 30 ] + OpenPose): Gen- erate anonymized person I ′ i using pose-preserved synthe- sis with ControlNet [ 45 ] conditioning 6 https : / / github . com / ultralytics / ultralytics , Ac- cessed October 25, 2025 Color Palette. C = { gray , beige, navy , white, black, brown, khaki, . . . } contains 20 div erse colors with bright- ness modifiers { light, dark, bright, faded, ... } for minimal collision probability (prompt: see below). Diffusion Configuration. Resolution : 768 × 768 α ( strength ) : 0 . 9 T ( steps ) : 25 ( SDXL persons ) λ OP ( OpenPose ) : 0 . 8 s ( guidance ) : 9 . 0 β ( color matching ) : 0 . 0 ( disabled ) (S1) Implementation. The anonymize and inpaint persons tool ex ecutes three-stage person anonymization with attrib ute div ersifica- tion. First, Qwen2.5-VL-32B extracts semantic attrib utes A i (pose, clothing, action) for each person i . Second, Qwen2.5- 32B-Instruct transforms A i → A ′ i by sampling colors from the di verse palette C , decoupling description from div ersi- fication. Third, SDXL with OpenPose ControlNet gener- ates anonymized person I ′ i using pose-preserved synthesis with the configuration above. Color matching is disabled ( β = 0 . 0 ) to allo w complete appearance transformation while OpenPose conditioning maintains pose realism and context consistenc y . The function returns anonymized count, output path, and processing time. S3.3.3. detect traffic signs: Y OLO-TS Specialized De- tection Model Specification. W e employ Y OLO-TS [ 5 ] (Y OLO for T raf fic Signs), a Y OLOv8-based detector fine-tuned on German T raffic Sign Detection Benchmark (GTSDB). The model specializes in detecting traffic signs across diverse scales and lighting conditions. Configuration. • Model: GTSDB best.pt (trained on German traf fic signs) • Resolution: 1024 × 1024 pixels • Confidence threshold: τ c = 0 . 2 • Output: Bounding boxes con verted to binary exclusion masks Implementation. The detect traffic signs tool employs Y OLO-TS with a YOLOv8 backbone and GTSDB- specialized detection head, trained on the German T raffic Sign Detection Benchmark. Operating at 1024 × 1024 resolution with confidence threshold τ c = 0 . 2 , the model prioritizes recall to capture all public signage and avoid false negati ves. Detected bounding boxes are con verted to binary exclusion masks for unified masking with other PII categories. The function returns detection count, binary masks, bounding boxes, and traf fic sign type IDs. S3.3.4. detect license plates: Custom YOLOv8s T rain- ing Model Dev elopment. W e trained a custom Y OLOv8s de- tector on UC3M License Plates (UC3M-LP) dataset [ 32 ] to achiev e high-recall license plate detection. The model operates at 1280 × 1280 resolution at inference for improved small object detection. T raining Configuration. • Architecture: YOLOv8s • Dataset: UC3M-LP (1,580 train, 395 val images) • Resolution: 1280 × 1280 pixels • Hardware: Single L40S GPU, batch size 8, 100 epochs • Performance: Precision=0.93, Recall=0.94, mAP 50 =0.92, mAP 50 - 95 =0.82 Inference Configuration. • Resolution: 1280 px (training resolution) • Confidence: τ c = 0 . 05 (ultra-low threshold maximizes recall for priv acy) • Rationale: Prioritize recall over precision to ensure no license plates remain visible S3.3.5. blur license plates: Gaussian Blur Anonymiza- tion Blur Parameters. License plates are anonymized via Gaussian blur to ensure text illegibility while maintaining image naturalness. I ′ ( x, y ) = X i,j ∈N I ( x + i, y + j ) · G σ ( i, j ) (S2) where G σ is a Gaussian kernel with σ = 8 pixels and kernel size 15 × 15 . Implementation. The blur license plates tool applies Gaussian blur anonymization to detected license plates. Using a Gaussian kernel with standard de viation σ = 8 pixels and kernel size 15 × 15 , the filter ensures com- plete text illegibility while maintaining visual naturalness compared to pixelation approaches. The smooth fallof f of the Gaussian function produces less jarring transitions at mask boundaries. The function returns the count of blurred plates and output path. S4. Qualitative Anonymization Examples Figure S1 compares anon ymization methods on CUHK03- NP test examples. Each row shows the same person pro- cessed by different techniques. Original Blur DP2 F ADM Ours Figure S1. Qualitati ve comparison of person anonymization meth- ods on CUHK03-NP test examples. Each row shows the same person processed by different methods (left to right): Original, Gaussian Blur , DeepPriv acy2 (DP2), F ADM, and our approach. Our method preserves pose and scene structure while effecti vely anonymizing identities with photorealistic results. Blur destroys facial details and overall image quality . DP2 produces synthetic faces with visible artifacts. F ADM maintains high similarity to originals (priv acy risk). Our method achiev es the optimal balance between priv acy protection and visual quality . S5. Diffusion Model Conditioning S5.1. Person Description Pr otocol L VLM Prompt Engineering . W e design the description prompt to extract semantic attributes while enforcing at- tribute di versification at the prompt le vel. The Qwen2.5-VL- 32B model is prompted to describe a person’ s build, pose, viewpoint, and action, while di versifying clothing attributes via randomized color and brightness selection. The opera- tional prompt used for this purpose is defined as follows: DETECTOR_PERSON_DESCRIPTION_PROMPT = "You are describing a person for AI-based anonymization. Generate a SHORT description for Stable Diffusion inpainting. CRITICAL: Invent DIVERSE clothing colors AND brightness - DO NOT copy from the image! Colors (pick randomly): gray, beige, navy, white, black, brown, khaki, blue, red, green, yellow, pink, teal, burgundy, olive, charcoal, maroon, tan, cream, sage. Brightness modifiers (pick randomly): light, dark, bright, faded, vibrant, muted, pale, deep, pastel, bold. Example colors: ’light beige’, ’dark navy’, ’bright red’, ’faded olive’, ’deep burgundy’, ’pale pink’, ’vibrant teal’. Include: body build (slim/medium/heavy), pose (standing/walking/sitting), view (front/side/back), action (talking/walking/standing). SPECIAL CASE: If uniformed (police/fire/military/security), preserve uniform type and official colors. Format: ’A [build] person [action], [view] view, wearing [brightness] [color] [ top] and [brightness] [color] [bottom]’ Limit: 20-30 words maximum. Examples: 1. Generic: {"description": "A medium-build person walking, side view, wearing dark olive jacket and pale charcoal pants"} 2. Generic: {"description": "A slim person standing, front view, wearing vibrant burgundy sweater and light khaki chinos"} 3. Generic: {"description": "A heavy-build person sitting, side view, wearing faded maroon hoodie and deep black jeans"} 4. Officer: {"description": "A police officer talking, facing forward, wearing navy police uniform with badge"} Return ONLY valid JSON: {"description": "your text here"}" S5.2. Person Inpainting Configuration SDXL Pipeline Parameters. Person synthesis employs SDXL with OpenPose ControlNet. The positive and nega- tiv e prompts (excluding the L VLM-generated description) are adapted from [ 48 ], and the pipeline uses the following hyperparameters: # Prompt Construction POSITIVE = "{LVLM_description}, RAW photo, 8k uhd, dslr, soft lighting, high quality, film grain, Fujifilm XT3, photorealistic, detailed" NEGATIVE = "deformed iris, deformed pupils, semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime, text, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck" # Hyperparameters RESOLUTION = 768 # Input/output resolution STRENGTH = 0.9 # Denoising strength α STEPS = 25 # Sampling steps T (SDXL persons) GUIDANCE_SCALE = 9.0 # Classifier-free guidance s OPENPOSE_SCALE = 0.8 # ControlNet conditioning strength λ _OP # Color Matching (disabled for full appearance transformation) LUMINANCE_MATCH = 0.0 COLOR_STRONG = 0.0 COLOR_SUBTLE = 0.0 Justification. • α = 0 . 9 : High denoising enables complete appearance transformation while preserving OpenPose structure. • T = 25 : SDXL architecture allows fe wer steps than SD 1.5 while maintaining quality at 768px. • s = 7 . 5 : Standard CFG scale for stable diffusion models. • λ OpenPose = 0 . 8 : Balances pose preservation vs. natural variation. S5.3. Object Inpainting Configuration SDXL Pipeline for Context-Dependent PII. Object anonymization emplo ys Canny edge conditioning for struc- ture preservation with Stable Dif fusion XL: # Prompt Construction POSITIVE = "{LVLM_description}" # Use LVLM-provided semantic description FALLBACK = "background scene" # If LVLM description unavailable NEGATIVE = "" # Empty by default (context-appropriate) # Hyperparameters RESOLUTION = 768 # Input/output resolution STRENGTH = 0.9 # Denoising strength α STEPS = 25 # Sampling steps T (SDXL objects) GUIDANCE_SCALE = 9.0 # CFG scale s CANNY_SCALE = 0.3 # ControlNet conditioning λ _CN CANNY_LOW = 10 # Hysteresis low threshold CANNY_HIGH = 30 # Hysteresis high threshold # Color Matching (disabled for complete transformation) LAB_STRONG = 0.0 LAB_SUBTLE = 0.0 Parameter Selection. • T = 25 : SDXL architecture enables efficient generation with fewer steps than SD 1.5 • λ Canny = 0 . 3 : Moderate edge guidance allo ws flexible background reconstruction during object remov al • Canny thresholds (10 , 30) : Captures salient edges while filtering noise • Empty negati ve prompt: Maximizes generation flexibility for div erse object types S6 . License Plate Detection: Extended Analysis S6.1. T raining Methodology Dataset Characteristics. UC3M-LP provides 1,975 an- notated images with di verse plate formats, vie wing angles, and lighting conditions representative of real-world street imagery . Model Architectur e. YOLOv8s (11.2M parameters) bal- ances detection accuracy with inference speed, trained at 1280px resolution for 100 epochs on a single NVIDIA L40S GPU. T raining Configuration. Key h yperparameters: • Optimizer : SGD with lr 0 =0.01, momentum=0.937, weight decay=0.0005 • Learning rate schedule : Cosine annealing with warmup (3 epochs), final lr=0.0001 • Loss weights : box=7.5, cls=0.5, dfl=1.5 (emphasizes lo- calization accuracy) • Batch size : 8 with mixed precision (AMP) training A ugmentation Strategy . W e employ aggressiv e data aug- mentation to improv e generalization: • Mosaic (prob=1.0): Combines 4 images into single train- ing sample, enhances multi-scale learning • Copy-paste (prob=0.3): Synthesizes challenging occlu- sion scenarios • MixUp (prob=0.1): Linear interpolation between samples for regularization • Multi-scale training : [0 . 5 , 1 . 5] × base resolution for scale in v ariance • Color jitter : HSV adjustments (h=0.015, s=0.7, v=0.4) for lighting in v ariance • Spatial augmentation : Horizontal flip (prob=0.5), rota- tion, translation, and scaling T raining Dynamics. Figure S2 sho ws conv ergence be- havior . The model achiev es stable performance by epoch 60, with v alidation mAP 50 - 95 plateauing at 0.816. Training and validation losses demonstrate consistent con ver gence without ov erfitting, validating our augmentation strate gy . Figure S2. T raining curves f or Y OLOv8s license plate detector . Loss curves and metrics over 100 epochs. Stable conver gence achiev ed by epoch 60. Perf ormance Analysis. Figure S3 presents precision- recall analysis. The model achieves: Precision = 0 . 93 Recall = 0 . 94 mAP 50 = 0 . 91 mAP 50 - 95 = 0 . 82 (S3) The high recall (0.94) is critical for pri vac y applications where false neg ativ es are unacceptable. Precision-recall curve analysis sho ws rob ust performance across confidence thresholds. S6.2. T raining Configuration W e train Mask R-CNN [ 11 ] with ResNet-50-FPN [ 21 ] on V isual Redactions Dataset [ 28 , 29 ]: 3 , 873 training images, Figure S3. License plate detection performance. Left: Precision- recall curve sho wing mAP 50 =0.91. Right: Normalized confusion matrix with 0.94 true positiv e rate. 21 , 489 annotations across 28 priv acy categories. COCO- pretrained weights initialize the backbone. T raining uses SGD (momentum 0 . 9 , weight decay 0 . 0001 ), base LR 0 . 002 with warmup ( 500 iterations), decay at 18 K/ 25 K iterations, batch size 8 ( 2 × L40S GPUs), 30 K iterations ( ∼ 5 . 7 hours). Augmentation: random horizontal flip, multi-scale training ( 640 – 800 px shortest edge). S6.3. Results T raining Dynamics. T otal loss decreases from 5 . 93 to 0 . 32 ov er 30K iterations ( 94 . 6% reduction). Component- wise: RPN classification ( 99 . 2% reduction), classification loss ( 98 . 5% ), mask loss ( 82 . 6% ), box regression ( 72 . 7% ). V alidation AP improv es monotonically from 11 . 77% (2K) to 17 . 70% (30K) without overfitting. Figure S4 shows training dynamics. Category-Level P erf ormance. Overall mask AP: 17 . 70% (AP 50 : 25 . 33% , AP 75 : 19 . 41% ). Severe scale dependency: AP S : 3 . 37% , AP L : 19 . 60% . Performance clusters by at- tribute type (T able S1 ): V isual classes (face: 56 . 98% , per- son: 48 . 68% , passport: 70 . 55% ) exploit COCO pretraining successfully . T extual categories exhibit systematic failure ( 10 / 10 categories < 5% AP: name 1 . 16% , address 0 . 64% , phone 0 . 00% ), despite abundant data (address: 1 , 902 in- stances, date time: 2 , 979 instances). Small object scale (AP S : 3 . 37% ) causes detection failure on spatially compact text. T est Set. Unified binary masks (logical OR across 28 cat- egories) achieve Dice: 75 . 83% , IoU: 68 . 71% , Precision: 80 . 71% , Recall: 78 . 02% (2,890/2,989 images; 99 missing predictions). Performance dominated by visually salient categories; ∼ 22% missed pixels primarily from small text below detection thresholds. S7 . Qualitati ve Comparison: Supervised vs. Zero-Shot Figure S5 sho ws randomly sampled test images comparing ground truth, Detectron2, and zero-shot predictions. Detec- tron2 achie ves higher precision on f aces, persons, and doc- uments (categories with abundant training data and COCO initialization) but fails completely on 99/2,989 test images (3.3%). Zero-shot provides broader category generalization but produces f alse positiv es on ambiguous regions. Figure S6 sho ws Detectron2 best cases: accurate mask boundaries on faces (AP: 56.98%), persons (48.68%), and passports (70.55%). Figure S7 shows zero-shot adv antages. S8. KITTI Dataset Example W e processed KITTI [ 8 ] image 0000000165 to e v alu- ate cross-dataset generalization. KITTI provides street- lev el autonomous dri ving imagery at lo wer resolution than CityScapes ( 2048 × 1024 ). Detection and Processing . The pipeline detected 14 per- sons and 3 rectangular signs (Figure S8 ). Phase 1 batch- inpainted all persons using SDXL with OpenPose Control- Net. The L VLM generated appearance descriptions for close person bounding boxes: “ A medium-build person walking, front vie w , wearing muted navy blazer and light tan pants”, “ A medium-build person standing, side view , wearing bright yello w t-shirt and dark teal pants”. Phase 2 processed rectan- gular signs using Canny ControlNet. PDCA Iteration and A udit. The workflow executed 3 PDCA iterations before reaching n max = 3 . The final audit detected 1 residual sign at bbox [948, 0, 1064, 100], likely a distant or partially occluded sign at the image boundary . This illustrates the priv acy-utility trade-off: n max bounds (a) T otal training loss (b) V alidation AP progression (c) Component losses breakdown (d) Learning rate schedule Figure S4. T raining dynamics of Mask R-CNN on V isual Redactions Dataset. (a) T otal loss shows 94 . 6% reduction over 30K iterations. (b) V alidation AP: segmentation AP from 11 . 77% to 17 . 70% , bbox AP from 9 . 24% to 18 . 04% . (c) Component losses: RPN classification ( 99 . 2% reduction), classification ( 98 . 5% ), mask ( 82 . 6% ), box regression ( 72 . 7% ). (d) Learning rate schedule: linear warmup (500 iterations), decay ( × 0 . 1 ) at 18K and 25K iterations. ex ecution time (391.8s) while occasionally leaving residual PII. Agent Dialogue Excerpt. T able S2 shows Phase 2 agent interactions. After Phase 1 completion, the AuditorA- gent inv okes classify pii , identifying 1 rectangular sign at bbox [464, 275, 526, 341] (instance ID: pii-0001). The OrchestratorAgent routes this to the Generativ eAgent, which inv okes anonymize and inpaint . T ool execu- tion includes scout-and-zoom segmentation (crop generation, Grounded-SAM-2 detection, mask mapping), protected area subtraction (15 masks from persons and traffic signs), and SDXL inpainting with Canny ControlNet. Final PII cover - age: 5.29% (37,677 pixels). Statistics. The processing generated 14 person masks and 3 sign masks. Combined PII coverage: 5.29% (37,677 pixels), primarily from pedestrians (4.88% persons, 0.41% signs). Processing time: 391.8s total — Phase 1: 160.5s for Grounded-SAM-2 detection; Phase 2: 231.3s for 3 PDCA iterations with Qwen2.5 VL 32B auditing and Qwen2.5 32B agent reasoning. Each detection included interpretable tool calls and bbox annotations. Cross-Dataset Generalization. The KITTI example tests cross-dataset generalization: lo wer resolution than CityScapes ( 2048 × 1024 ), different camera perspectiv e (dashboard vs. urban mapping vehicle), and v arying scene complexity . Zero-shot detection and pose-guided inpainting adapted to KITTI without dataset-specific tuning. T able S1. Per -category mask AP on V isual Redactions validation set ( n = 1 , 611 images). Categories grouped by performance regime. T raining instance counts reflect sev ere class imbalance. Abundant data does not guarantee success: date time ( 2 , 979 instances → 3 . 75% AP), address ( 1 , 902 instances → 0 . 64% AP). Category distribution shows visual attributes (face, person) dominate both instance count ( 46 . 4% ) and performance, while textual cate gories suffer systematic failure despite comprising 36% of training in- stances. Category Instances AP (%) High P erformance (AP > 20%) Passport 153 70.55 Face 4,039 56.98 Person 5,937 48.68 Fingerprint 79 45.02 License Plate 371 33.02 Student ID 37 32.35 T icket 420 32.14 Receipt 113 30.35 Credit Card 102 18.06 Medium P erformance ( 5% < AP ≤ 20% ) Disability 73 17.76 Driv ers License 30 16.16 Nudity 320 15.20 Ethnic Clothing 123 12.74 Mail 73 12.21 Medicine 116 10.50 Handwriting 1,077 10.29 Birth Date 60 7.48 Signature 252 7.40 Low/Zer o P erformance (AP ≤ 5% ) Education 48 5.18 Date/T ime 2,979 3.75 Username 310 3.51 Address Home 90 2.11 Landmark 807 1.64 Name 1,785 1.16 Email 68 0.69 Address 1,902 0.64 Phone 105 0.00 ID Card 20 0.00 T otal 21,489 17.70 T able S2. Agent dialogue excerpt from KITTI Phase 2 workflo w (indirect PII detection and anonymization). Agent Action/Message UserProxyAgent Phase 1 complete. AuditorAgent: clas- sify PII in 0000000165.png using classify pii . AuditorAgent T ool call: classify pii image: artifacts/.../0000000165.png T ool Result instances: [ { det prompt: "rectangular sign with text .", bbox: [464, 275, 526, 341], instance id: "pii-0001" } ] meta: { provider: ollama, model: qwen2.5vl:32b, tiles: 1 } OrchestratorAgent Phase 2: [classify pii ✓ ] Found indi- rect PII. GenerativeAgent, anonymize instances. Generativ eAgent T ool call: anonymize and inpaint instances: [ { bbox: [464, 275, 526, 341], description: "rectangular sign with text", instance id: "pii-0001" } ] T ool Execution Scout-and-zoom: crop [464, 275, 526, 237] → detect → map mask Protected area subtraction: 15 masks (persons, traffic signs) Inpainting: SDXL + Canny ControlNet PII mask update: 37,677 pixels (5.29% cov erage) ... Figure S5. Random examples from V isual Redactions test set. Ground truth (with category labels), Detectron2, and zero-shot predictions. Supervised achiev es higher precision on common categories b ut fails on 99 images (3.3%). Zero-shot provides full cov erage with broader generalization but occasional f alse positiv es. Figure S6. Detectr on2 best cases. T op 4 examples by Dice score. Supervised excels on faces, persons, documents with ab undant training data and COCO initialization. Figure S7. Zero-shot adv antage cases. T op 4 examples where zero-shot outperforms Detectron2. Agents with L VLM utilization succeeds on rare categories, no vel formats, and lo w-confidence regions. Figure S8. KITTI example (0000000165). Left: Original street scene with 14 pedestrians and multiple traffic signs. Right: Anonymized output after PDCA workflo w . Phase 1 batch-inpainted 14 persons using SDXL with OpenPose ControlNet. Phase 2 processed 3 rectangular signs with Canny ControlNet. Final PII coverage: 5.29% (37,677 pixels). T otal processing time: 391.8s with 3 PDCA iterations.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment