ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization

ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark & codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models, 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs.


💡 Research Summary

The paper addresses a critical fragmentation problem in the field of Fake Image Detection and Localization (FIDL), which currently consists of four largely independent sub‑domains: Deepfake detection, Image Manipulation Detection and Localization (IMDL), AI‑Generated Content (AIGC) detection, and Document Image Manipulation Localization (Doc). Each sub‑domain has its own datasets, model families, and evaluation protocols, leading to “domain silos” that impede cross‑task comparison, reproducibility, and the development of generalizable forensic solutions.

To overcome these barriers, the authors introduce ForensicHub, the first unified benchmark and codebase that spans all four FIDL domains. The core contribution is a modular, configuration‑driven architecture that decomposes a forensic pipeline into four interchangeable components:

  1. Datasets – unified loading interface returning a standard dictionary (image, label, mask).
  2. Transforms – task‑agnostic preprocessing and augmentation pipelines.
  3. Models – any detection or segmentation model that conforms to the unified output format (image‑level classification and/or pixel‑level mask).
  4. Evaluators – a suite of GPU‑accelerated image‑level and pixel‑level metrics (AP, MCC, TNR, TPR, AUC, ACC, F1, IoU, etc.).

The system is driven entirely by YAML configuration files, allowing users to assemble training or testing pipelines without writing code. A code‑generator is also provided for custom extensions. Crucially, ForensicHub adopts an adapter‑based design to integrate two widely used existing benchmarks—DeepfakeBench and IMDLBenCo—so that their datasets, models, and evaluation scripts can be reused unchanged.

Implemented resources:

  • 10 baseline models (including three re‑implemented from scratch) covering detection and localization across domains.
  • 6 backbone networks (ResNet, Xception, EfficientNet, SegFormer, Swin‑Transformer, ConvNeXt).
  • 23 datasets: 6 Deepfake (FaceForensics++, Celeb‑DF, DFDC, etc.), 6 IMDL (CASIA, COVERAGE, Columbia, etc.), 2 AIGC (DiffusionForensics, GenImage), and 5 Document (Doctamper, T‑SROIE, OSTF, etc.).
  • 42 model‑dataset combinations and 16 cross‑domain evaluation scenarios.
  • 11 GPU‑accelerated metrics covering both image‑level and pixel‑level performance.

Beyond compatibility, ForensicHub introduces two new benchmark protocols for the previously under‑represented AIGC and Document domains. For AIGC, the protocol evaluates generalization across diffusion‑based and multi‑model generated images using DiffusionForensics and the large‑scale GenImage dataset. For Document, the protocol includes both automatically and manually annotated forgeries, covering receipts, certificates, and ID cards, thereby testing robustness to real‑world document tampering. Both protocols emphasize cross‑domain generalization by allowing models trained on one dataset to be tested on another.

The authors conduct an extensive empirical study, yielding eight actionable insights:

  1. Multi‑scale and frequency‑aware architectures (e.g., HRNet, Xception with DCT or SRM streams) consistently outperform pure RGB‑only models across domains.
  2. Vision‑language backbones (e.g., CLIP‑ViT) provide strong generalization for AIGC detection, likely due to semantic grounding.
  3. Manual mask annotations improve localization accuracy compared to purely synthetic masks, highlighting the value of high‑quality ground truth.
  4. Joint image‑level and pixel‑level evaluation is essential; models that excel in classification may still produce poor masks, and vice versa.
  5. Backbone selection matters: Xception excels in deepfake detection, while Swin‑Transformer and ConvNeXt are more competitive for high‑resolution document localization.
  6. Cross‑domain transfer learning is feasible: several DeepfakeBench models retain performance when fine‑tuned on IMDL or AIGC tasks, suggesting shared low‑level forensic cues.
  7. Metric standardization (fixed 0.5 threshold, GPU‑accelerated computation) dramatically reduces variance in reported results and improves reproducibility.
  8. Unified benchmark accelerates research: by providing a single codebase, the time to set up experiments drops from weeks to hours, fostering rapid iteration and fair comparison.

Overall, ForensicHub represents a significant step toward breaking the siloed nature of FIDL research. By offering a flexible, extensible, and fully open‑source platform, it enables researchers to develop and evaluate forensic models that are robust across manipulation types, datasets, and real‑world scenarios. The authors release all code, configurations, and processed data at https://github.com/scu‑zjz/ForensicHub, inviting the community to build upon this foundation.


Comments & Academic Discussion

Loading comments...

Leave a Comment