EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head
Recent photo-realistic 3D talking head via 3D Gaussian Splatting still has significant shortcoming in emotional expression manipulation, especially for fine-grained and expansive dynamics emotional editing using multi-modal control. This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. Our key idea is a novel Emotion-aware Gaussian Diffusion, which includes an action unit (AU) prompt Gaussian diffusion process for fine-grained facial animator, and moreover an accurate text-to-AU emotion controller to provide accurate and expansive dynamic emotional editing using text input. Experiments on public EmoTalk3D and RenderMe-360 datasets demonstrate superior emotional subtlety, lip-sync fidelity, and controllability of our EmoDiffTalk over previous works, establishing a principled pathway toward high-quality, diffusion-driven, multimodal editable 3D talking-head synthesis. To our best knowledge, our EmoDiffTalk is one of the first few 3D Gaussian Splatting talking-head generation framework, especially supporting continuous, multimodal emotional editing within the AU-based expression space.
💡 Research Summary
The paper “EmoDiffTalk: Emotion-aware Diffusion for Editable 3D Gaussian Talking Head” addresses a significant limitation in current state-of-the-art 3D talking head generation using 3D Gaussian Splatting (3DGS): the lack of fine-grained, expansive, and controllable emotional expression editing via multi-modal inputs like text.
The authors propose a novel framework named EmoDiffTalk, whose core innovation is an “Emotion-aware Gaussian Diffusion” process. This process moves away from implicit, holistic emotion control and instead grounds the generation in an anatomically meaningful intermediate representation: Action Units (AUs). AUs correspond to specific facial muscle movements, providing a principled and interpretable basis for expression modeling.
The EmoDiffTalk pipeline consists of two main stages. First, a Canonical Gaussian Rig is reconstructed from multi-view images of a subject. This rig separates facial and non-facial regions and employs a tri-plane-based color prediction method for better attribute decoding later.
The second and central stage is the Emotion-aware Gaussian Diffusion, which animates the canonical rig. It comprises two key components:
- AU-prompt Gaussian Diffusion: A transformer encoder first maps speech features (HuBERT) to a sequence of AU codes. These AU codes then act as precise conditioning prompts for a diffusion model, which predicts temporal offsets for the positions of facial Gaussians. This establishes a direct, fine-grained link between speech-driven AUs and geometric facial motion. Following this, lightweight decoders (RotNet and OPCNet) decode other Gaussian attributes like rotation and opacity, with the OPCNet utilizing a novel “Feature Line” to capture AU-specific appearance variations.
- Text-to-AU Emotion Controller: To enable intuitive text-based editing, a separate controller maps a textual emotion prompt (e.g., “the person is smiling”) to a binary activation vector specifying which AUs should be enhanced or suppressed. This vector applies a simple yet effective enhancement-suppression transformation to the original speech-driven AU codes, producing “Emotional AU Codes.” These modified codes are fed into the diffusion process, allowing the system to inject the desired emotion while preserving the accurate lip-sync derived from the original speech.
Extensive experiments on the EmoTalk3D and RenderMe-360 datasets demonstrate EmoDiffTalk’s superiority over previous state-of-the-art methods including EAMM, SadTalker, Real3D-Portrait, EmoTalk3D, Hallo3, and EchoMimic. It achieves significant improvements in objective metrics such as PSNR (image quality), CPBD (sharpness), and LMD (lip-sync accuracy). User studies also confirm its advantages in perceived video fidelity, image quality, and, crucially, the accuracy and naturalness of emotion control. EmoDiffTalk establishes a new pathway for high-quality, diffusion-driven 3D talking head synthesis that seamlessly integrates photo-realistic rendering with precise, multi-modal emotional editing.
Comments & Academic Discussion
Loading comments...
Leave a Comment