SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild

SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate, audio-driven lip movements. However, due to input corruption, minor artifacts may arise in the surrounding facial regions and the background. In Stage 2, we develop a mask-free tuning pipeline to address mask-induced artifacts. Specifically, on the basis of the Stage 1 model, we develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video and random sampled audio. We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency. Extensive experiments show that our method achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation under in-the wild lip-syncing scenarios.


💡 Research Summary

SyncAnyone addresses the long‑standing trade‑off in audio‑driven lip‑sync: mask‑based training yields accurate lip movements but harms background consistency and identity preservation, while mask‑free approaches require costly paired data. The authors propose a two‑stage Progressive Self‑Correction (PSC) framework that first learns a diffusion‑based video transformer (DiT) to inpaint masked mouth regions, then uses this model to generate synthetic paired data for mask‑free fine‑tuning.

In Stage 1, a multi‑reference DiT is trained under the Flow Matching paradigm. Input consists of a spatial mask that zeros out the mouth area, a sequence of reference frames that remain unmasked, and an audio embedding (e.g., wav2vec2). The model predicts a deterministic vector field that transports noisy latents to the target video, effectively reconstructing the masked region while preserving the surrounding context. Multi‑reference temporal modeling supplies long‑range identity cues, enabling robust synthesis under large head poses and fast motion.

Stage 2 exploits the Stage 1 model to create pseudo‑paired samples on‑the‑fly. For each source video, a random audio clip is sampled, the Stage 1 model generates a lip‑synced clip, and the background of the original video is composited back onto the generated frames. This “background fusion” eliminates mask‑induced artifacts while keeping the lip region edited. The resulting synthetic pairs (original video, edited video) are used to train a mask‑free DiT with the same architecture but without any mask channel. The loss combines pixel‑wise reconstruction on the mouth area with perceptual and temporal consistency terms, ensuring that the final model edits only the lips while preserving the exact background and identity.

Extensive experiments on wild‑in‑the‑scene datasets demonstrate that SyncAnyone outperforms state‑of‑the‑art GAN‑based methods (Wav2Lip, StyleSync) and recent diffusion‑based approaches (LatentSync, OmniSync) across several metrics: lower LSE‑Sync (better audio‑lip alignment), lower FID and LPIPS (higher visual fidelity), and reduced T‑LPIPS (greater temporal coherence). Qualitative results show resilience to extreme poses, occlusions, scene cuts, and diverse visual styles. Ablation studies confirm that (1) the multi‑reference conditioning improves motion realism, (2) the synthetic paired data are crucial for background consistency, and (3) the background fusion module significantly reduces edge artifacts.

The paper’s contributions are threefold: (i) identification of the mask‑induced trade‑off and a novel PSC pipeline that bridges mask‑based and mask‑free learning, (ii) an efficient online pseudo‑pair generation scheme that preserves identity and background without manual data collection, and (iii) a unified diffusion‑transformer framework that achieves state‑of‑the‑art performance on in‑the‑wild lip‑sync tasks. The authors also discuss limitations such as computational cost of the first stage and potential bias from synthetic audio‑video pairs, and suggest future work on real‑time inference, multilingual audio conditioning, and integration with 3D facial priors.


Comments & Academic Discussion

Loading comments...

Leave a Comment