Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5–x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

💡 Research Summary

**
This paper tackles two fundamental bottlenecks that arise when using autoregressive video diffusion models for streaming, long‑horizon generation: (1) the ever‑increasing latency caused by the linear growth of the key‑value (KV) cache, and (2) the exploding GPU memory consumption that forces practitioners to limit temporal context windows, thereby harming long‑range consistency. The authors first conduct a systematic analysis of redundancy in autoregressive diffusion. They identify three persistent sources of waste: (i) near‑duplicate keys across successive frames, (ii) slowly evolving, largely semantic queries and keys that make many dot‑product computations redundant, and (iii) cross‑attention over long textual prompts where only a small subset of tokens is relevant for any given frame.

Based on these observations, they propose a unified, training‑free attention framework composed of three modules, all built around fast approximate nearest‑neighbor (ANN) search:

TempCache compresses the KV cache by exploiting temporal correspondence. For each new frame, queries are matched to keys from previous frames using lightweight ANN (LSH or quantized similarity). Highly similar key‑value pairs are merged, bounding cache growth to a fixed size and keeping peak memory roughly constant throughout a rollout.
AnnSA sparsifies self‑attention. Instead of attending to every cached key, each query retrieves a small candidate set of keys (e.g., the top 30 % by similarity) via ANN. Attention is then computed only on this reduced set using sparse kernels such as FlashInfer. Empirically, this retains >85 % of the attention mass while cutting compute dramatically.
AnnCA sparsifies cross‑attention. Long prompts are pruned per‑frame: a query‑driven ANN search selects the most relevant prompt tokens, discarding the rest before the softmax. This reduces the cross‑attention matrix size without noticeable quality loss, as most frames only need a handful of prompt tokens.

All three components are training‑free: they operate on top of pre‑trained diffusion transformers (e.g., DiT‑based video diffusion) without any fine‑tuning or additional parameters. The ANN mechanisms are deliberately lightweight—LSH hashing or product quantization—so that the overhead of candidate selection is negligible compared with the savings in dot‑product calculations.

The authors evaluate the system on a single NVIDIA H100 GPU. For a 2‑minute video (≈3000 frames), the baseline autoregressive diffusion takes about 11 minutes, while the proposed pipeline finishes in roughly 2 minutes, yielding a 5‑10× speedup. Quality metrics (PSNR, SSIM, FID) show less than 0.1 % degradation relative to the unmodified model, and visual inspection confirms that temporal coherence and fine‑grained motion are preserved. Memory usage remains flat across the entire rollout, contrasting sharply with the baseline where memory grows linearly with frame count. The same gains are demonstrated on autoregressive video world‑model tasks, confirming the method’s generality.

In summary, the paper introduces a practical, inference‑only solution that simultaneously curtails latency and memory growth in autoregressive video diffusion. By treating attention as an ANN problem and compressing the KV cache through temporal correspondence, the authors achieve near‑identical visual quality while delivering order‑of‑magnitude speedups. The approach is broadly applicable to other sequence‑based generative models, opening avenues for real‑time, long‑form video synthesis, controllable world‑model generation, and interactive neural game engines.

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

💡 Research Summary

Comments & Academic Discussion

Leave a Comment