AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques. First, we introduce Diff-Gen, a large-scale benchmark dataset comprising 100k diffusion-generated fakes that capture broad spectral artifacts unlike traditional GAN datasets. Models trained on Diff-Gen demonstrate stronger cross-domain generalization, particularly on previously unseen image generators. Second, we propose AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters while keeping the CLIP backbone frozen. We further show via layer ablation that pruning the final transformer block of the vision encoder enhances the retention of high-frequency generative artifacts, significantly boosting detection accuracy. Our evaluation spans 25 challenging test sets, covering synthetic content generated by GANs, diffusion models, and commercial tools, establishing a new state-of-the-art in both standard and cross-domain scenarios. We further demonstrate the framework’s versatility through few-shot generalization (using as few as 320 images) and source attribution, enabling the precise identification of generator architectures in closed-set settings.


💡 Research Summary

The paper tackles the pressing problem of generalization in deepfake detection, which has become increasingly difficult as image synthesis techniques have evolved from GANs to diffusion models and commercial generators. Traditional detectors trained on narrow GAN‑centric datasets fail to recognize the high‑frequency, noise‑like artifacts produced by modern diffusion models, leading to a “sink label” problem where unseen fakes are misclassified as real. To address this, the authors introduce two complementary contributions.

First, they construct Diff‑Gen, a large‑scale benchmark consisting of 100 k diffusion‑generated images covering the same 20 object categories as LSUN. Unlike the ProGAN‑based datasets that exhibit periodic up‑sampling artifacts, Diff‑Gen captures broad spectral characteristics—especially high‑frequency Gaussian‑like noise—that are typical of diffusion generators such as Stable Diffusion and Midjourney. Experiments show that models pre‑trained on Diff‑Gen achieve markedly better cross‑domain performance than those trained on GAN data.

Second, the authors propose AdaptPrompt, a parameter‑efficient transfer learning framework built on the frozen CLIP vision‑language model. AdaptPrompt adds two lightweight modules: (1) a visual adapter inserted as a residual bottleneck after the CLIP visual encoder, and (2) learnable continuous text prompts for the “real” and “fake” classes. The visual adapter consists of a down‑projection, a non‑linear activation, and an up‑projection, forcing the network to encode compact representations of generative fingerprints while preserving the bulk of CLIP’s semantic knowledge. Crucially, the authors discover that the final transformer block of CLIP’s vision encoder heavily abstracts away pixel‑level anomalies. By pruning this block (the “v2” variant) and feeding penultimate‑layer features to the adapter, they retain more high‑frequency information, which substantially boosts detection accuracy.

The textual side replaces static hand‑crafted prompts with a sequence of learnable context vectors. These vectors are concatenated with the class token and passed through the frozen CLIP text encoder, producing class embeddings that are aligned with the adapted visual embeddings via cosine similarity. The whole system is trained with a standard cross‑entropy loss and a learnable temperature parameter, updating only the adapter weights and prompt vectors—approximately 0.1 % of CLIP’s total parameters—thus ensuring high training efficiency and low risk of over‑fitting.

Comprehensive evaluation spans 25 diverse test sets, including a wide range of GANs, diffusion models, commercial tools, and standard forensic benchmarks (e.g., FF++). AdaptPrompt trained on Diff‑Gen achieves state‑of‑the‑art average precision and accuracy across all domains, outperforming full fine‑tuning, linear probing, and single‑modality PEFT baselines while using far fewer trainable parameters. Notably, the method excels on the most challenging subsets—commercial generators and recent diffusion models—where it improves performance by 10–15 percentage points relative to the best prior approaches.

Additional experiments demonstrate the framework’s versatility. In few‑shot scenarios, as few as 320 labeled images suffice to reach competitive performance, highlighting the data efficiency of the approach. For source attribution, the model can correctly identify the specific generator architecture among ten candidates with over 90 % accuracy, indicating that the visual adapter captures subtle architectural fingerprints while the textual prompts provide a semantic grounding for “fakeness.”

In summary, the paper makes two key advances: (1) the Diff‑Gen dataset, which enriches training data with diffusion‑style artifacts, and (2) the AdaptPrompt framework, which synergistically combines visual adapters and prompt tuning on a frozen CLIP backbone. Together they close the generalization gap in deepfake detection, delivering a solution that is both highly effective and computationally lightweight. Future work may explore extending AdaptPrompt to other multimodal foundation models, real‑time deployment, and broader forensic tasks such as video deepfake detection.


Comments & Academic Discussion

Loading comments...

Leave a Comment