Improving Generative Adversarial Network Generalization for Facial Expression Synthesis
Facial expression synthesis aims to generate realistic facial expressions while preserving identity. Existing conditional generative adversarial networks (GANs) achieve excellent image-to-image translation results, but their performance often degrades when test images differ from the training dataset. We present Regression GAN (RegGAN), a model that learns an intermediate representation to improve generalization beyond the training distribution. RegGAN consists of two components: a regression layer with local receptive fields that learns expression details by minimizing the reconstruction error through a ridge regression loss, and a refinement network trained adversarially to enhance the realism of generated images. We train RegGAN on the CFEE dataset and evaluate its generalization performance both on CFEE and challenging out-of-distribution images, including celebrity photos, portraits, statues, and avatar renderings. For evaluation, we employ four widely used metrics: Expression Classification Score (ECS) for expression quality, Face Similarity Score (FSS) for identity preservation, QualiCLIP for perceptual realism, and Fréchet Inception Distance (FID) for assessing both expression quality and realism. RegGAN outperforms six state-of-the-art models in ECS, FID, and QualiCLIP, while ranking second in FSS. Human evaluations indicate that RegGAN surpasses the best competing model by 25% in expression quality, 26% in identity preservation, and 30% in realism.
💡 Research Summary
The paper tackles the well‑known generalization problem of conditional generative adversarial networks (cGANs) for facial expression synthesis (FES). While state‑of‑the‑art cGANs such as StarGAN, GANimation, and recent attention‑enhanced variants produce impressive results on in‑distribution data, their performance collapses when presented with out‑of‑distribution (OOD) inputs such as celebrity photographs, historical portraits, statues, or avatar renderings. To address this, the authors propose Regression GAN (RegGAN), a two‑stage architecture that combines a localized ridge‑regression layer with a multi‑scale attention refinement network.
Regression Layer (G_E).
The first stage learns a shallow mapping from an input face to an intermediate image that already encodes the target expression. Instead of a global ridge‑regression, which would require O(N²) parameters for an N‑pixel image, the authors adopt a patch‑based formulation: each output pixel is expressed as a linear combination of its r × r neighbourhood. This sparsified ridge regression dramatically reduces the parameter count to r² + 1 per pixel, mitigates over‑fitting, and, crucially, makes the mapping largely independent of the global image statistics (lighting, background, style). The loss consists of a least‑squares reconstruction term plus an L2 regularizer (λ_Reg), and the optimal weights admit a closed‑form solution, enabling fast training. Separate regression models are trained for each source‑target expression pair (e.g., neutral→happy).
Refinement Network (G_R).
The second stage refines the coarse output of G_E into a photorealistic image. It follows an encoder‑decoder backbone enriched with three types of attention blocks: Encoding Attention Blocks (EAB), Latent Attention Blocks (LAB), and Decoding Attention Blocks (DAB). Each block contains a Feature Unit (convolutional layers with PReLU) and an Attention Unit built on an hourglass network that produces spatial attention maps α. These maps focus computation on expression‑critical regions such as eyes, eyebrows, and mouth, while preserving global facial structure. The network stacks three EABs, ten LABs in the bottleneck, and three DABs, allowing multi‑scale feature aggregation.
Training Procedure.
Training proceeds sequentially. First, G_E is optimized solely with the ridge‑regression loss, yielding an intermediate representation x_E for every training sample. Then G_E is frozen, and G_R together with a discriminator D is trained using a conditional adversarial loss (L_GAN) combined with a pixel‑level L1/L2 loss. The total objective is L = L_Reg + λ·L_GAN, where λ balances expression fidelity (provided by regression) against realism (provided by adversarial learning).
Experimental Setup.
The model is trained on the Controlled Facial Expression (CFEE) dataset, which contains only a few hundred well‑controlled human faces. Evaluation is performed on: (1) the CFEE test split (in‑distribution), and (2) five OOD collections—CelebA‑HQ, celebrity photos, classical portraits, 3‑D avatar renders, and stone sculptures. Four quantitative metrics are reported:
- Expression Classification Score (ECS): a pretrained expression classifier measures how well the target expression is realized.
- Face Similarity Score (FSS): a face‑recognition model evaluates identity preservation.
- QualiCLIP: CLIP‑based text‑image similarity assesses perceptual realism and semantic alignment.
- Fréchet Inception Distance (FID): standard GAN quality metric.
RegGAN outperforms six recent baselines on ECS, FID, and QualiCLIP, and ranks second on FSS (the best baseline is only ~2 % lower). Human preference studies show RegGAN is preferred over the strongest competitor by 25 % for expression quality, 26 % for identity preservation, and 30 % for overall realism.
Ablation Findings.
Removing the regression layer and training a vanilla GAN leads to severe expression distortion and identity loss on OOD inputs. Excluding the attention blocks degrades FID and QualiCLIP dramatically, confirming their role in detail recovery. Varying the patch size r reveals a sweet spot at r = 5; smaller patches lack sufficient context, while larger patches re‑introduce over‑parameterization.
Limitations and Future Work.
- Per‑Expression Models: A distinct regression model is required for each source‑target expression pair, which limits scalability to continuous expression intensity or multi‑label scenarios.
- Extreme Pose / Occlusion: The local patch assumption struggles with large pose variations or heavy occlusions (e.g., masks, glasses), where the receptive field may not capture the necessary context.
- Temporal Consistency: The paper focuses on still images; extending to video would require additional constraints to avoid flicker.
Future directions include learning a unified regression that handles continuous expression vectors, integrating 3‑D face priors to better cope with pose, and combining diffusion‑based denoising to further enhance fine‑grained texture.
Impact.
RegGAN demonstrates that coupling a lightweight, domain‑agnostic regression front‑end with a powerful attention‑driven GAN back‑end yields a system that generalizes far beyond its training distribution while maintaining high visual fidelity. This opens up practical applications in avatar animation, heritage restoration, virtual reality, and any scenario where limited labeled facial data must be leveraged to synthesize expressive faces across diverse visual domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment