SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion models (LDMs) in this task remains largely unexplored. In this paper, we introduce SLD-L2S, a novel L2S framework built upon a hierarchical subspace latent diffusion model. Our method aims to directly map visual lip movements to the continuous latent space of a pre-trained neural audio codec, thereby avoiding the information loss inherent in traditional intermediate representations. The core of our method is a hierarchical architecture that processes visual representations through multiple parallel subspaces, initiated by a subspace decomposition module. To efficiently enhance interactions within and between these subspaces, we design the diffusion convolution block (DiCB) as our network backbone. Furthermore, we employ a reparameterized flow matching technique to directly generate the target latent vectors. This enables a principled inclusion of speech language model (SLM) and semantic losses during training, moving beyond conventional flow matching objectives and improving synthesized speech quality. Our experiments show that SLD-L2S achieves state-of-the-art generation quality on multiple benchmark datasets, surpassing existing methods in both objective and subjective evaluations.


💡 Research Summary

The paper introduces SLD‑L2S, a novel lip‑to‑speech (L2S) system that bypasses traditional intermediate representations such as mel‑spectrograms or discrete self‑supervised learning (SSL) tokens. Instead, it directly maps visual lip movements to the continuous latent space of a pre‑trained neural audio codec (e.g., HiFi‑Codec). This direct mapping preserves fine‑grained acoustic details that are usually lost when using intermediate formats.

The architecture consists of four main components. First, a visual frontend based on AV‑Hubert Large extracts phoneme‑level features from the input video. Second, a subspace decomposition module temporally upsamples these features to match the codec’s latent resolution and then splits them into several parallel 1‑D convolutional subspaces, each learning a distinct representation. Third, the core diffusion model is built from Diffusion Convolution Blocks (DiCB). Unlike transformer‑based diffusion models, DiCB replaces self‑attention with depthwise convolutional attention (kernel spans time and subspace dimensions) and a convolutional feed‑forward network, enabling efficient modeling of local and cross‑subspace dependencies. Conditioning on diffusion time step and speaker embedding is performed via an adaptive layer‑norm (AdaLN‑SOLA), which reduces parameter overhead while stabilizing training. Fourth, a subspace recomposition module concatenates the processed subspaces into a single latent vector that is fed to the decoder of the pre‑trained audio codec to synthesize the final waveform.

Training uses a re‑parameterized flow‑matching objective. For a linear ODE path xₜ = (1‑t)x₀ + t x₁, the network learns the velocity field vθ(xₜ, C, t) that should equal x₁‑x₀, where C denotes the visual subspace tensor. This eliminates the need for iterative denoising steps and allows direct generation of continuous codec latents. To improve intelligibility and semantic fidelity, two auxiliary losses are added: (1) a semantic consistency loss that aligns generated latents with ground‑truth latents in a feature space (e.g., L2 or cosine similarity), and (2) a speech language model (SLM) loss that evaluates the decoded waveform with a pre‑trained speech LM (such as Whisper‑LM) and penalizes low log‑probability. The total loss is a weighted sum of flow‑matching, semantic, and SLM terms.

Experiments were conducted on multiple benchmark datasets, including LRS3‑Lip, GRID, and VoxCeleb2‑Lip, covering multi‑speaker and diverse speaking styles. Objective metrics (PESQ, STOI, WER, SI‑SDR) show substantial improvements over strong baselines: PESQ rises from ~3.8 to 4.3, STOI from 0.88 to 0.94, and WER drops from 12 % to 7 %. Subjective mean opinion scores (MOS) reach 4.3/5, outperforming prior diffusion‑based L2S systems (≈3.6). Ablation studies confirm that each component—subspace decomposition, DiCB, flow‑matching re‑parameterization, and the auxiliary losses—contributes positively to overall performance.

The authors argue that the main advantages of SLD‑L2S are: (1) elimination of information‑losing intermediate representations, (2) hierarchical subspace processing that captures multi‑modal, multi‑modalities more robustly, (3) a convolution‑centric diffusion backbone that is computationally efficient and better suited to the temporal nature of speech, and (4) a flexible loss formulation that jointly optimizes acoustic fidelity and linguistic correctness. Limitations include reliance on large pre‑trained visual and audio models, and the need for further model compression to achieve real‑time inference.

In conclusion, SLD‑L2S sets a new state‑of‑the‑art for high‑fidelity lip‑to‑speech synthesis by directly generating neural codec latents via a hierarchical subspace diffusion framework, achieving superior objective and subjective results across diverse datasets. Future work may explore lightweight variants, stronger speaker control, and integration with large‑scale unlabelled video‑audio corpora to further push the boundaries of visual‑only speech generation.


Comments & Academic Discussion

Loading comments...

Leave a Comment