CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization
To mitigate the threat of misinformation, multimodal manipulation localization has garnered growing attention. Consider that current methods rely on costly and time-consuming fine-grained annotations, such as patch/token-level annotations. This paper proposes a novel framework named Coupling Implicit and Explicit Cues (CIEC), which aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs utilizing only coarse-grained image/sentence-level annotations. It comprises two branches, image-based and text-based weakly-supervised localization. For the former, we devise the Textual-guidance Refine Patch Selection (TRPS) module. It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors. Followed by the background silencing and spatial contrast constraints to suppress interference from irrelevant areas. For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization. Followed by the asymmetric sparse and semantic consistency constraints to mitigate label noise and ensure reliability. Extensive experiments demonstrate the effectiveness of our CIEC, yielding results comparable to fully supervised methods on several evaluation metrics.
💡 Research Summary
The paper addresses the problem of locating manipulated regions in image‑text pairs without relying on expensive fine‑grained annotations. Existing multimodal forgery localization methods typically require pixel‑level masks or token‑level labels, which are costly to obtain and limit scalability, especially when the forged regions in the two modalities are independent. To overcome these challenges, the authors propose CIEC (Coupling Implicit and Explicit Cues), a weakly‑supervised framework that only needs coarse image‑level or sentence‑level truth labels. CIEC consists of two parallel branches. The image‑centric branch introduces the Textual‑guidance Refine Patch Selection (TRPS) module. TRPS first extracts salient nouns and adjectives from the accompanying text, generates a coarse visual attention map aligned with these words, and then refines candidate patches using spatial priors (e.g., typical object locations). Two auxiliary losses—Background Silencing and Spatial Contrast—are applied to suppress irrelevant background activations and to sharpen the distinction between suspicious and non‑suspicious regions. The text‑centric branch contains the Visual‑deviation Calibrated Token Grounding (VCTG) module. VCTG computes a visual deviation score for each token via cross‑modal attention, selecting tokens with high deviation as “visual suspicion tokens.” An asymmetric sparse constraint penalizes tokens that are unlikely to be forged, reducing label noise, while a semantic consistency constraint preserves intra‑textual logical coherence. The two branches are jointly optimized, allowing visual cues to guide text grounding and vice‑versa. Experiments on the DGM4 dataset and several other benchmarks demonstrate that CIEC achieves performance (IoU, F1, accuracy) comparable to fully supervised state‑of‑the‑art methods, while cutting annotation cost by more than 90 %. Notably, the model remains robust under cross‑modal inconsistency cases (e.g., image true/text fake or vice versa), where prior weakly‑supervised approaches struggle. The authors also discuss extensions such as integrating large language models for richer textual guidance and coupling with vision transformers, positioning CIEC as a versatile foundation for future multimodal forgery detection research.
Comments & Academic Discussion
Loading comments...
Leave a Comment