Speech Driven Talking Face Generation from a Single Image and an Emotion Condition
Visual emotion expression plays an important role in audiovisual speech communication. In this work, we propose a novel approach to rendering visual emotion expression in speech-driven talking face generation. Specifically, we design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input to render a talking face video synchronized with the speech and expressing the conditioned emotion. Objective evaluation on image quality, audiovisual synchronization, and visual emotion expression shows that the proposed system outperforms a state-of-the-art baseline system. Subjective evaluation of visual emotion expression and video realness also demonstrates the superiority of the proposed system. Furthermore, we conduct a human emotion recognition pilot study using generated videos with mismatched emotions among the audio and visual modalities. Results show that humans respond to the visual modality more significantly than the audio modality on this task.
💡 Research Summary
The paper addresses the problem of generating realistic talking‑face videos that not only synchronize with a given speech signal but also convey a user‑specified emotional expression. Unlike prior works that either infer emotion from the audio or limit expression to the lip region, the authors propose an end‑to‑end generative adversarial network (GAN) that takes three inputs: (1) a raw speech waveform, (2) a single reference face image, and (3) a categorical emotion label (six basic emotions: anger, disgust, fear, happiness, neutral, sadness). The system produces a 25‑fps video where the facial movements are aligned with the speech and the overall facial expression follows the conditioned emotion.
The architecture consists of a generator and two discriminators. The generator is built from five sub‑modules: a speech encoder (five 1‑D convolutions followed by two LSTM layers), an image encoder (six 2‑D convolutions with U‑Net skip connections), an emotion encoder (two fully‑connected layers that map a one‑hot label to an embedding replicated across time), a noise encoder (a single‑layer LSTM that processes per‑frame Gaussian noise to model head motions unrelated to speech or emotion), and a video decoder (U‑Net‑style decoder that concatenates all embeddings and progressively upsamples to full‑resolution frames).
Two discriminators are employed during training: a frame discriminator that distinguishes real from generated frames (ensuring high visual fidelity and identity consistency) and an emotion discriminator that classifies the expressed emotion of each generated frame (providing a supervised “emotion discriminative loss”). In addition to the standard Mouth Region Mask (MRM) reconstruction loss and a perceptual loss, the overall objective combines the WGAN‑GP loss from the frame discriminator and the cross‑entropy loss from the emotion discriminator.
The model is trained on a dataset annotated with the six basic emotions, using 8 kHz speech and 25 fps video. Objective metrics include PSNR/SSIM for image quality, Lip‑Sync Error for audio‑visual alignment, and emotion classification accuracy for the emotion discriminator. Across all metrics the proposed method outperforms state‑of‑the‑art baselines such as the temporal‑GAN approaches of Vougioukas et al. and the robust talking‑face system of Eskimez et al.
Subjective evaluation is conducted via Amazon Mechanical Turk. Participants rate videos on “realness” and “emotion expression”. The proposed system receives higher scores than the baseline, confirming that the added emotion conditioning improves perceived naturalness and emotional clarity. Moreover, a pilot study presents videos where the audio and visual modalities convey mismatched emotions. Results show that participants rely more heavily on visual cues when judging emotion, corroborating prior findings that visual information dominates multimodal emotion perception.
Key contributions are: (1) introducing the first end‑to‑end talking‑face generator that accepts an explicit emotion condition, thereby decoupling speech content from visual affect; (2) designing an emotion‑discriminative loss that forces the generator to produce recognizable emotional expressions; (3) providing empirical evidence, through a human perception study, that visual emotion cues outweigh auditory cues in incongruent multimodal settings.
Limitations include reliance on categorical emotion labels, which restricts the model to six discrete states and prevents smooth interpolation across the arousal‑valence space. The system is also evaluated on a single speaker per identity, so generalization to diverse head poses, lighting conditions, or multiple speakers remains an open question. Future work could extend the conditioning to continuous emotion embeddings, incorporate more varied facial dynamics, and explore cross‑speaker training to broaden applicability.
Comments & Academic Discussion
Loading comments...
Leave a Comment