AuthFace: Towards Authentic Blind Face Restoration with Face-oriented Generative Diffusion Prior

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Blind face restoration (BFR) is a fundamental and challenging problem in computer vision. To faithfully restore high-quality (HQ) photos from poor-quality ones, recent research endeavors predominantly rely on facial image priors from the powerful pretrained text-to-image (T2I) diffusion models. However, such priors often lead to the incorrect generation of non-facial features and insufficient facial details, thus rendering them less practical for real-world applications. In this paper, we propose a novel framework, namely AuthFace that achieves highly authentic face restoration results by exploring a face-oriented generative diffusion prior. To learn such a prior, we first collect a dataset of 1.5K high-quality images, with resolutions exceeding 8K, captured by professional photographers. Based on the dataset, we then introduce a novel face-oriented restoration-tuning pipeline that fine-tunes a pretrained T2I model. Identifying key criteria of quality-first and photography-guided annotation, we involve the retouching and reviewing process under the guidance of photographers for high-quality images that show rich facial features. The photography-guided annotation system fully explores the potential of these high-quality photographic images. In this way, the potent natural image priors from pretrained T2I diffusion models can be subtly harnessed, specifically enhancing their capability in facial detail restoration. Moreover, to minimize artifacts in critical facial areas, such as eyes and mouth, we propose a time-aware latent facial feature loss to learn the authentic face restoration process. Extensive experiments on the synthetic and real-world BFR datasets demonstrate the superiority of our approach.

💡 Research Summary

AuthFace tackles the persistent shortcomings of using large‑scale text‑to‑image (T2I) diffusion models for blind face restoration (BFR). While models such as StableDiffusion‑XL (SDXL) provide powerful generative priors, they are trained for general image synthesis and consequently produce overly smooth skin, miss fine facial textures, and sometimes insert non‑facial artifacts when applied to severely degraded portraits. To address these issues, the authors propose a two‑stage framework that first creates a face‑oriented diffusion prior and then leverages it for authentic restoration.

In Stage I, a curated dataset of 1,500 ultra‑high‑resolution (≥8K) portrait photographs taken by professional photographers is assembled. Each image is annotated not only with conventional semantic tags (gender, clothing, background) but also with “photography‑guided” descriptors that capture stylistic nuances such as lighting direction, skin texture, makeup, and expression. These rich prompts are generated automatically using a vision‑language model (Qwen2.5‑VL‑7B‑Instruct) and then refined by the photographers. The authors fine‑tune the pretrained SDXL model on this dataset, preserving the original diffusion training objectives while injecting the detailed facial prompts. This process converts the generic T2I model into a face‑oriented generative prior that can faithfully reproduce high‑frequency facial details.

Stage II introduces a ControlNet adapter that conditions the diffusion process on the degraded input image. Rather than relying solely on an L2 (MSE) loss, which tends to blur critical regions, the authors design a “time‑aware latent facial feature loss.” This loss operates on the latent representations at specific diffusion timesteps, explicitly emphasizing regions where human perception is most sensitive—eyes, mouth, and skin pores. By freezing the UNet weights learned in Stage I and training only the ControlNet, the facial prior remains intact while the model learns to map low‑quality inputs to high‑quality outputs.

Extensive quantitative and qualitative evaluations are conducted on both synthetic degradation benchmarks and real‑world low‑resolution portrait datasets. AuthFace consistently outperforms state‑of‑the‑art BFR methods such as CodeFormer, DiBIR, and BFRfusion across PSNR, SSIM, LPIPS, and identity preservation (ID) metrics. Visual comparisons highlight markedly reduced artifacts around the eyes and mouth, as well as more realistic skin granularity. User studies further confirm that participants perceive AuthFace’s results as more natural and authentic.

Key insights from the work include: (1) a small, high‑quality, photography‑curated dataset can be more effective than massive low‑quality collections for fine‑tuning diffusion priors; (2) incorporating photography‑guided prompts dramatically improves a T2I model’s ability to generate fine facial details; (3) a time‑aware latent loss is crucial for preserving perceptually important facial regions during restoration. The paper opens avenues for extending the approach to broader demographic diversity, real‑time applications via model compression, and other low‑level vision tasks such as low‑light enhancement or video frame restoration.

AuthFace: Towards Authentic Blind Face Restoration with Face-oriented Generative Diffusion Prior

💡 Research Summary

Comments & Academic Discussion

Leave a Comment