Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher’s flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3% in Dynamic Degree, 8.7% in VisionReward, and 16.7% in Instruction Following. Project page and the code: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}
💡 Research Summary
The paper tackles a fundamental problem in real‑time interactive video generation: how to distill a powerful pretrained bidirectional diffusion model into a few‑step autoregressive (AR) student that can generate frames sequentially with low latency. Existing state‑of‑the‑art methods such as Self‑Forcing follow a two‑stage pipeline: (1) ODE distillation to initialize the AR student, and (2) a Distribution‑Matching Distillation (DMD) fine‑tuning stage. While this reduces the sampling‑step gap, it overlooks a deeper “architectural gap” caused by replacing a full‑attention bidirectional teacher (which sees future frames) with a causal‑attention AR student (which only sees past frames).
The authors identify that ODE distillation requires injectivity of the paired data: each noisy sample must correspond to a unique clean sample. In the bidirectional‑to‑bidirectional setting, injectivity holds at the video level because the probability‑flow ODE (PF‑ODE) of diffusion is a bijection. However, for an AR student, injectivity must hold at the frame level: each noisy frame must map to a unique clean frame under the PF‑ODE of the AR teacher. Existing methods violate this condition because they use a bidirectional teacher to generate training pairs for the AR student. Consequently, the same noisy frame can be paired with multiple possible clean frames, forcing the student to learn a conditional expectation rather than the true flow map, which manifests as blurred and inconsistent video outputs. The subsequent DMD stage cannot repair this deficiency, as demonstrated by experiments that isolate the architectural gap.
To resolve the issue, the paper proposes Causal Forcing, a method that first trains an AR diffusion model (the AR teacher) using teacher‑forcing (i.e., conditioning on clean past frames). The authors theoretically show that teacher‑forcing yields a PF‑ODE that satisfies frame‑level injectivity, whereas diffusion‑forcing does not. Once the AR teacher is obtained, its PF‑ODE trajectories are sampled to create injective (noisy frame, clean frame) pairs. These pairs are then used for ODE distillation, ensuring that the AR student learns the exact flow map of the AR teacher. After this initialization, a standard DMD fine‑tuning stage is applied, but now it only needs to close the sampling‑step gap, not the architectural one.
Comprehensive experiments compare Causal Forcing against a wide range of baselines, including bidirectional and autoregressive models, as well as recent distillation techniques (CausVid, Self‑Forcing, Standard DMD, etc.). Across all metrics—Dynamic Degree, VisionReward, and Instruction Following—Causal Forcing outperforms the previous best Self‑Forcing by 19.3 %, 8.7 %, and 16.7 % respectively, while maintaining the same inference latency (≈30 ms per frame). Qualitative results show sharper frames, better temporal coherence, and higher fidelity to user instructions.
In summary, the paper makes three key contributions: (1) it pinpoints frame‑level injectivity as a necessary condition for ODE‑based distillation of AR video diffusion models; (2) it demonstrates that existing AR distillation pipelines violate this condition, leading to suboptimal performance; and (3) it introduces Causal Forcing, which bridges the architectural gap by using an AR teacher that naturally satisfies injectivity, thereby enabling high‑quality, low‑latency interactive video generation. The work opens avenues for further research on multimodal conditioning, higher‑resolution generation, and more efficient teacher‑student architectures in the video diffusion domain.
Comments & Academic Discussion
Loading comments...
Leave a Comment