FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint
We introduce FactorPortrait, a video diffusion method for controllable portrait animation that enables lifelike synthesis from disentangled control signals of facial expressions, head movement, and camera viewpoints. Given a single portrait image, a driving video, and camera trajectories, our method animates the portrait by transferring facial expressions and head movements from the driving video while simultaneously enabling novel view synthesis from arbitrary viewpoints. We utilize a pre-trained image encoder to extract facial expression latents from the driving video as control signals for animation generation. Such latents implicitly capture nuanced facial expression dynamics with identity and pose information disentangled, and they are efficiently injected into the video diffusion transformer through our proposed expression controller. For camera and head pose control, we employ Plücker ray maps and normal maps rendered from 3D body mesh tracking. To train our model, we curate a large-scale synthetic dataset containing diverse combinations of camera viewpoints, head poses, and facial expression dynamics. Extensive experiments demonstrate that our method outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency.
💡 Research Summary
FactorPortrait introduces a video diffusion framework that enables controllable portrait animation from a single reference image, a driving video, and an arbitrary camera trajectory. The core contribution lies in fully disentangling three control dimensions—facial expression, head pose, and camera viewpoint—and injecting them into a pre‑trained video diffusion transformer (Wan‑DiT) with minimal additional parameters.
Expression control is achieved by a pre‑trained expression encoder that extracts 128‑dim latent codes from each frame of the driving video. These codes capture fine‑grained dynamics such as micro‑expressions, wrinkles, and tongue movement while being largely identity‑agnostic. An “Expression Controller” maps each latent to scale and shift parameters of Adaptive Layer Normalization (AdaLN) inside the DiT blocks, allowing the diffusion process to be modulated per‑frame without altering the backbone weights.
Pose control uses a feed‑forward 3D human mesh estimator to obtain a parametric body mesh for every driving frame. The meshes are rendered into dense normal maps, which provide pixel‑aligned spatial cues of head orientation and subtle body movement. Normal maps preserve spatial correspondence better than Euler angles or rotation matrices, facilitating precise pose conditioning within the diffusion model.
Viewpoint control adopts Plücker ray maps. For each frame, the relative camera pose between the driving view and the reference view is encoded as a ray map that stores both direction and origin for every pixel. This continuous representation enables smooth camera trajectories and novel view synthesis without discretization artifacts.
All three conditioning signals—identity latent (from a VAE encoder), pose normal maps, and ray maps—are concatenated with the diffusion noise latent in a Condition Fusion Layer. A 1×1 convolution reduces channel dimensionality before feeding the combined tensor into the DiT backbone. The DiT model, originally trained as a video foundation model, receives both the time‑step embedding and the per‑frame AdaLN parameters, ensuring temporal coherence while allowing independent manipulation of expression, pose, and view.
Training leverages a two‑pronged dataset strategy. Real data consists of ~12 k iPhone captures (single‑camera) and ~1.4 k multi‑view studio recordings, providing diverse identities, expressions, and poses but limited camera motion. To supply continuous view supervision, the authors synthesize a large corpus using animatable Gaussian head avatars fitted to the studio captures. Two synthetic video types are generated: ViewSweep (static expression, varying camera) and Dynamic‑Sweep (simultaneous changes in expression, pose, and camera). This synthetic data offers perfectly disentangled supervision for each control signal, while real data grounds the model in realistic lighting, hair, and background variations.
Losses combine reconstruction (L2), perceptual (VGG), view‑consistency (ray‑map alignment), and expression‑consistency (cosine similarity between generated frames and encoded expression latents). The model is first fine‑tuned on synthetic data, then adapted to real data for domain generalization.
Extensive experiments on multiple benchmark datasets demonstrate that FactorPortrait outperforms state‑of‑the‑art GAN‑based and diffusion‑based portrait animation methods across quantitative metrics (FID, LPIPS, CSIM) and user studies. Notably, it achieves superior view consistency when rendering novel viewpoints and faithfully reproduces subtle facial dynamics that prior 2D landmark or 3DMM‑based approaches miss. Ablation studies confirm the necessity of the expression controller (AdaLN) and Plücker ray maps; removing either leads to degraded realism or view jitter.
Limitations include reliance on synthetic data for full control coverage, which may affect generalization to extreme lighting or hair styles, and the current focus on head‑and‑upper‑body regions rather than full‑body animation. Future work could incorporate lighting conditioning, extend to full‑body avatars, and explore lightweight inference for real‑time applications.
In summary, FactorPortrait presents a unified, disentangled control scheme for portrait video synthesis, combining expression latent injection, normal‑map pose conditioning, and ray‑map viewpoint encoding within a video diffusion transformer. The method delivers high‑fidelity, temporally coherent, and fully controllable portrait animations, opening new possibilities for virtual avatars, AR/VR experiences, and interactive media.
Comments & Academic Discussion
Loading comments...
Leave a Comment