DF-LLaVA: Unlocking MLLMs for Synthetic Image Detection via Knowledge Injection and Conflict-Driven Self-Reflection

DF-LLaVA: Unlocking MLLMs for Synthetic Image Detection via Knowledge Injection and Conflict-Driven Self-Reflection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

With the increasing prevalence of synthetic images, evaluating image authenticity and locating forgeries accurately while maintaining human interpretability remains a challenging task. Existing detection models primarily focus on simple authenticity classification, ultimately providing only a forgery probability or binary judgment, which offers limited explanatory insights into image authenticity. Moreover, while MLLM-based detection methods can provide more interpretable results, they still lag behind expert models in terms of pure authenticity classification accuracy. To address this, we propose DF-LLaVA, a novel and effective framework that unlocks the intrinsic discrimination potential of MLLMs. Our approach first mines latent knowledge from the MLLM itself and then injects it into the model via fine-tuning. During inference, conflict signals arising from the model’s predictions activate a self-reflection process, leading to the final refined responses. This framework allows LLaVA to achieve outstanding detection accuracy exceeding expert models while still maintaining the interpretability offered by MLLMs. Extensive experiments confirm the superiority of DF-LLaVA, achieving both high accuracy and explainability in synthetic image detection. Code is available online at: https://github.com/Eliot-Shen/DF-LLaVA.


💡 Research Summary

The paper introduces DF‑LLaVA, a novel framework that elevates multimodal large language models (MLLMs) from merely providing binary authenticity judgments to delivering expert‑level detection accuracy together with human‑readable explanations. The authors observe that while the vision encoder of LLaVA‑v1.5 (based on CLIP‑ViT‑L‑14) already contains strong discriminative cues for distinguishing real from synthetically generated images, this knowledge is largely lost when the visual features are passed to the language model. Consequently, LLaVA’s pure classification performance lags behind specialized synthetic‑image detectors, even though it can generate natural‑language rationales.

To unlock this latent capability, DF‑LLaVA employs two complementary mechanisms: Prompt‑Guided Knowledge Injection (PGKI) and Conflict‑Driven Self‑Reflection (CDSR). In PGKI, a lightweight binary classifier (a 10‑dim hidden MLP) is trained on the frozen vision encoder’s


Comments & Academic Discussion

Loading comments...

Leave a Comment