FaceSnap: Enhanced ID-fidelity Network for Tuning-free Portrait Customization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Benefiting from the significant advancements in text-to-image diffusion models, research in personalized image generation, particularly customized portrait generation, has also made great strides recently. However, existing methods either require time-consuming fine-tuning and lack generalizability or fail to achieve high fidelity in facial details. To address these issues, we propose FaceSnap, a novel method based on Stable Diffusion (SD) that requires only a single reference image and produces extremely consistent results in a single inference stage. This method is plug-and-play and can be easily extended to different SD models. Specifically, we design a new Facial Attribute Mixer that can extract comprehensive fused information from both low-level specific features and high-level abstract features, providing better guidance for image generation. We also introduce a Landmark Predictor that maintains reference identity across landmarks with different poses, providing diverse yet detailed spatial control conditions for image generation. Then we use an ID-preserving module to inject these into the UNet. Experimental results demonstrate that our approach performs remarkably in personalized and customized portrait generation, surpassing other state-of-the-art methods in this domain.

💡 Research Summary

FaceSnap addresses the growing demand for high‑fidelity, personalized portrait generation by leveraging a single reference image and a pre‑trained Stable Diffusion XL model without any fine‑tuning at inference time. The authors identify two main challenges in this domain: preserving the subject’s identity (including fine‑grained facial details) and providing flexible pose control while keeping the generation pipeline fast and resource‑efficient. Existing solutions either rely on costly fine‑tuning (e.g., LoRA, DreamBooth) that yields high identity fidelity but incurs significant latency, or adopt a single‑stage inference approach that is fast but suffers from insufficient detail preservation.

To bridge this gap, FaceSnap introduces three core components. First, the Facial Attribute Mixer fuses low‑level CLIP image features with high‑level face‑ID embeddings. After linear projection to a common dimension, the CLIP features serve as keys and values while the face‑ID tokens act as queries in a cross‑attention operation, producing preliminary fused features. A learnable set of 16 query tokens then passes through a Transformer decoder to generate the final fused representation (f_mix). This design captures both the texture‑rich information from CLIP and the identity‑centric information from the face recognition model, outperforming simple concatenation or projection schemes.

Second, the Landmark Predictor supplies spatial control. Using DECA, a 3D Morphable Model (3DMM) extracts shape, pose, and expression parameters from both the reference (source) and a driving image. The predictor combines the source shape with the driving pose and expression, reconstructs a new 3D face, and projects it onto the 2D plane to obtain 72 facial landmarks. These landmarks preserve the source identity while adopting the desired pose, offering far richer spatial guidance than the 5‑point controls used in many prior works.

Third, the Face Fidelity Reinforce Network (FFRNet) integrates the fused features and landmarks into the diffusion UNet via a ControlNet‑style architecture. Instead of feeding a textual prompt, the network directly conditions the UNet’s cross‑attention layers on f_mix, encouraging the model to focus on identity cues. During training, the backbone diffusion model remains frozen; the loss combines a masked diffusion term (to limit learning to facial regions) and an ID loss based on cosine similarity between face‑recognition embeddings of generated and reference images. The total loss is L_total = L_diff + λ_id·L_id, with λ_id tuned to balance fidelity and diversity.

The authors train FaceSnap on a curated dataset of 160 k facial images from VGGFace, FFHQ, and CelebA‑HQ, augmenting it to 800 k samples with BLIP‑2 generated captions. Training runs for 360 k steps on two NVIDIA A800 GPUs (batch size 7 per GPU). For inference, they employ the DreamShaperXLv2.1 Turbo model, DPM++ SDE sampler, 8 steps, guidance scale 2, and λ_id = 0.5.

Evaluation uses four widely accepted metrics: CLIP‑face (cosine similarity between CLIP embeddings of real and generated images), FaceSim (similarity of face‑ID embeddings), CLIP‑T (text‑image alignment), and FID (image quality). FaceSnap achieves the highest CLIP‑face (81.4) and FaceSim (73.6) scores among competitors, while also delivering the lowest FID (205.6). Its CLIP‑T score is slightly lower, reflecting the model’s emphasis on identity over textual fidelity. In terms of efficiency, inference takes 6.1 seconds and consumes 18.1 GB VRAM for a 1024×1024 image, comparable to other single‑stage methods.

Qualitative comparisons across five distinct identities, multiple poses, and varied prompts show that FaceSnap consistently renders realistic skin texture, accurate eye reflections, and precise mouth shapes, while preserving the subject’s unique facial characteristics. A user study with 40 participants confirms a preference for FaceSnap in both realism and identity preservation.

Ablation studies further validate each component. Removing the Facial Attribute Mixer or replacing it with a simple concatenation reduces both identity metrics and increases FID. Excluding FFRNet or the landmark conditions degrades performance, and using the Landmark Predictor instead of raw driving landmarks yields the best identity scores, demonstrating its role in maintaining facial structure across pose changes.

Limitations include the additional preprocessing required for 3DMM reconstruction, which adds latency, and potential challenges with extreme lighting or expression variations not fully covered in the training set. Future work may explore lightweight 3D reconstruction, stronger multimodal alignment losses, and extensions to full‑body or video generation.

In summary, FaceSnap presents a novel, plug‑and‑play framework that delivers high‑fidelity, identity‑preserving portrait generation in a single diffusion pass without any fine‑tuning. By intelligently merging CLIP and face‑ID features and leveraging 72‑point landmark control, it achieves state‑of‑the‑art results while remaining practical for real‑world applications such as virtual avatar creation, personalized marketing content, and digital entertainment.

FaceSnap: Enhanced ID-fidelity Network for Tuning-free Portrait Customization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment