JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Based on the powerful Diffusion Transformer (DiT) architecture, JavisDiT simultaneously generates high-quality audio and video content from open-ended user prompts in a unified framework. To ensure audio-video synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, which consists of 10,140 high-quality text-captioned sounding videos and focuses on synchronization evaluation in diverse and complex real-world scenarios. Further, we specifically devise a robust metric for measuring the synchrony between generated audio-video pairs in real-world content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and data are available at https://javisverse.github.io/JavisDiT-page/.

💡 Research Summary

JavisDiT introduces a unified diffusion‑transformer framework for Joint Audio‑Video Generation (JAVG), tackling the twin challenges of high‑fidelity multimodal synthesis and fine‑grained spatio‑temporal synchronization. Built on the powerful Diffusion Transformer (DiT) backbone, the model employs shared AV‑DiT blocks for video and audio streams, each equipped with Spatio‑Temporal Self‑Attention (ST‑SelfAttn) for intra‑modal context aggregation and a coarse‑grained cross‑attention layer that injects global semantic information from a T5 text encoder.

The core novelty lies in the Hierarchical Spatial‑Temporal Synchronized Prior (HiST‑Sypo) Estimator. This module extracts two complementary priors from the input prompt: (1) a global semantic prior (simply the T5 embedding) that conveys “what” is happening, and (2) a fine‑grained spatio‑temporal prior that specifies “where” and “when” events occur. The fine‑grained prior is generated by a four‑layer transformer encoder‑decoder that queries 32 spatial tokens and 32 temporal tokens from the 77 hidden states of ImageBind’s text encoder. Rather than producing deterministic tokens, the estimator predicts a Gaussian distribution (mean and variance) for each token, allowing stochastic sampling that captures the inherent variability of event locations and timings.

These priors are injected into the diffusion process via Fine‑Grained ST‑CrossAttention (ST‑CrossAttn), which aligns video and audio tokens along both spatial (H×W for video, M for audio spectrogram) and temporal (T_v, T_a) axes. In addition, a Multi‑Modality Bidirectional Cross‑Attention (MM‑BiCrossAttn) computes an attention matrix between video queries and audio keys, then uses it to generate both audio‑to‑video and video‑to‑audio cross‑attention maps, fostering rich mutual information exchange.

Training leverages a contrastive learning objective for the prior estimator: multiple sampled priors from the same text are encouraged to stay close to each other while remaining distinct from priors derived from different texts. This strategy yields a robust estimator that generalizes to unseen, complex scenes.

To evaluate the system, the authors construct JavisBench, a new benchmark of 10,140 high‑quality sounding videos with textual captions. The dataset spans five dimensions and 19 scene categories, with more than half containing intricate, multi‑event scenarios that reflect real‑world usage. Recognizing that existing metrics (e.g., FAD, VMAF) inadequately capture synchronization, they propose JavisScore, a temporal‑aware semantic alignment metric that jointly measures the correspondence of event onset/offset times and semantic similarity between audio and visual streams.

Extensive experiments on both JavisBench and established datasets (Landscape, AIST++) demonstrate that JavisDiT outperforms prior state‑of‑the‑art methods such as AV‑DiT, MM‑LDM, Uniform, and SyncFlow across all quality metrics (PSNR, SSIM, FVD, FAD) and, crucially, achieves a 12‑18 % improvement in JavisScore on complex scenes. Ablation studies confirm the importance of each component: removing HiST‑Sypo degrades synchronization, while omitting MM‑BiCrossAttn harms overall generation fidelity.

In summary, JavisDiT advances JAVG by (1) leveraging DiT’s strong token‑level diffusion capabilities, (2) introducing a hierarchical prior mechanism that provides both global context and fine‑grained spatio‑temporal cues, (3) enabling bidirectional cross‑modal attention for richer fusion, and (4) offering a realistic benchmark and a dedicated synchronization metric. The work opens avenues for future research on large‑scale multimodal prior pre‑training, real‑time streaming generation, and extending the framework to longer, narrative‑style audio‑visual content.

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment