MakeItTalk: Speaker-Aware Talking-Head Animation
We present a method that generates expressive talking heads from a single facial image with audio as the only input. In contrast to previous approaches that attempt to learn direct mappings from audio to raw pixels or points for creating talking faces, our method first disentangles the content and speaker information in the input audio signal. The audio content robustly controls the motion of lips and nearby facial regions, while the speaker information determines the specifics of facial expressions and the rest of the talking head dynamics. Another key component of our method is the prediction of facial landmarks reflecting speaker-aware dynamics. Based on this intermediate representation, our method is able to synthesize photorealistic videos of entire talking heads with full range of motion and also animate artistic paintings, sketches, 2D cartoon characters, Japanese mangas, stylized caricatures in a single unified framework. We present extensive quantitative and qualitative evaluation of our method, in addition to user studies, demonstrating generated talking heads of significantly higher quality compared to prior state-of-the-art.
💡 Research Summary
MakeItTalk introduces a unified framework for generating expressive talking‑head videos from a single portrait image and an audio clip, without requiring any subject‑specific training data. The key novelty lies in explicitly disentangling the speech signal into two latent spaces: a content embedding that captures phonetic and prosodic information, and a speaker embedding that encodes the identity‑dependent dynamics such as head pose, eye movement, and subtle facial expressions. This separation is achieved by leveraging a voice‑conversion network (e.g., Auto‑VC) that has been shown to isolate content from speaker style in the audio domain.
The content embedding drives a “speech‑content animation” module that predicts the motion of the lips, jaw, and nearby facial regions. Meanwhile, the speaker embedding modulates a “speaker‑aware animation” branch that adds identity‑specific dynamics to the same set of facial landmarks. Both branches feed into a temporal model composed of an LSTM and a self‑attention mechanism, allowing the system to capture short‑term phoneme‑level changes as well as long‑range dependencies such as head turns and expressive gestures. The output of this network is a sequence of 68 2‑D facial landmarks for each audio frame. By operating on landmarks rather than raw pixels, the method reduces the dimensionality of the output space from millions of pixels to a few dozen degrees of freedom, which dramatically improves data efficiency and enables training on moderately sized datasets.
To render the final video, the authors propose two landmark‑to‑image pipelines. For non‑photorealistic inputs (paintings, sketches, 2‑D cartoons, Japanese manga, caricatures), a simple Delaunay‑triangulation based warping technique is applied, deforming the original artwork according to the predicted landmark motion. For realistic human faces, an image‑to‑image translation network (a UNet‑style GAN) converts the landmark map into a full‑resolution RGB frame while preserving the original texture, lighting, and high‑frequency details of the source photograph. This dual‑path approach allows the same underlying landmark predictor to drive both photorealistic and stylized animations.
Extensive quantitative experiments compare MakeItTalk against state‑of‑the‑art lip‑sync models, 3D morphable‑model based methods, and recent GAN‑based talking‑head generators. Metrics such as Landmark Distance (LMD), PSNR, SSIM, as well as newly introduced measures of head‑pose variance and expression diversity, all show substantial improvements. User studies further confirm that participants perceive the generated videos as more natural, expressive, and speaker‑specific. Notably, the system generalizes to unseen faces and voices at test time, producing plausible animations for characters that were never present in the training set—a capability that many prior methods lack.
The paper also contributes a set of evaluation protocols for speaker‑aware animation, including objective statistics on head motion, action‑unit activation, and subjective questionnaires that assess identity preservation and overall realism. By demonstrating that content‑style disentanglement, a well‑designed temporal model, and a landmark‑centric pipeline can jointly produce high‑quality, style‑agnostic talking heads, MakeItTalk sets a new benchmark for audio‑driven facial animation. Its implications span virtual avatars, live streaming, gaming, and mixed‑reality applications where rapid, automatic generation of expressive characters from minimal input is essential.
Comments & Academic Discussion
Loading comments...
Leave a Comment