FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space
We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ($64\times64\times4$ spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.
💡 Research Summary
FSVideo presents a novel image‑to‑video (I2V) diffusion framework that dramatically reduces inference time while maintaining competitive visual quality. The system is built around three core components: (1) a highly compressed video autoencoder (FSAE) that downsamples spatial dimensions by a factor of 64 (64 × 64) and temporally by a factor of 4, yielding a latent tensor of shape (128 × T/4 × H/64 × W/64); (2) a 14‑billion‑parameter diffusion transformer (DIT) serving as the base generative model, enhanced with a “layer‑memory” mechanism that stores intermediate representations for reuse across layers, thereby improving information flow without appreciable overhead; and (3) a multi‑resolution upsampler consisting of a lightweight CNN latent upscaler followed by a second 14‑B DIT refiner that operates on a few diffusion steps (typically 4‑8) using step‑distillation to keep extra computation minimal.
The autoencoder is derived from the DC‑AE architecture, extended with causal 3‑D convolutions and additional transformer blocks to achieve the 64 × 64 × 4 compression. Training proceeds in three stages, progressively increasing spatial resolution (256² → 512² → 1024²) and video length (17 → 61 → 121 frames). Losses combine L1, LPIPS, and a GAN objective with a 3‑D multi‑scale discriminator. To align the latent space with semantic video features, the authors introduce Video‑VF loss, which maps the latent tensor to Dinov2 frame embeddings, then applies marginal cosine similarity and distance‑matrix similarity terms. Intrinsic‑dimension analysis (using Gride) shows that this regularization yields the lowest latent complexity among tested variants, indicating a more compact and generative‑friendly representation.
The DIT’s layer‑memory design adds a small linear projection per layer that writes a “memory slot” containing a summary of the previous layer’s output. Subsequent layers can attend to this slot, enabling cross‑layer context reuse and better utilization of the model’s capacity. This mechanism is especially beneficial for the 14‑B model, where naïve depth can lead to diminishing returns.
The upsampling stage first expands the compressed latent via a CNN (spatial up‑scaling by 2‑4×). The upsample‑refiner DIT then refines the upscaled latent in a few diffusion steps, guided by the original image’s encoder features through cross‑attention in the decoder. This multi‑resolution strategy yields high‑fidelity videos (720p and above) while adding only a fraction of the computational cost of a full‑resolution diffusion pass.
Empirically, FSVideo (14 B base + 14 B upsampler) achieves a 42.3× speedup over Wan2.1‑14B‑720P, a prominent open‑source video diffusion model, while delivering comparable or slightly better scores on PSNR, SSIM, and CLIP‑based text‑video alignment metrics. The compressed latent space reduces token count per diffusion step, leading to lower GPU memory usage; combined with temporal slicing, 3‑D patch training, and LPIPS temporal slicing, the authors train on 1024² × 121‑frame videos using eight 80 GB GPUs.
Limitations include the focus on I2V; extending to pure text‑to‑video would require additional conditioning mechanisms. The fixed 4× temporal compression may hinder very long sequences (>200 frames) where motion continuity is critical. Future work could explore adaptive compression ratios, stronger flow‑based regularization, and integration of large language models for richer textual control.
In summary, FSVideo demonstrates that a combination of extreme latent compression, layer‑memory‑augmented diffusion transformers, and a lightweight multi‑resolution upsampler can shift the speed‑quality trade‑off of video generation by an order of magnitude, opening the door to real‑time or low‑cost cloud video synthesis applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment