Draft-and-Target Sampling for Video Generation Policy
Video generation models have been used as a robot policy to predict the future states of executing a task conditioned on task description and observation. Previous works ignore their high computational cost and long inference time. To address this challenge, we propose Draft-and-Target Sampling, a novel diffusion inference paradigm for video generation policy that is training-free and can improve inference efficiency. We introduce a self-play denoising approach by utilizing two complementary denoising trajectories in a single model, draft sampling takes large steps to generate a global trajectory in a fast manner and target sampling takes small steps to verify it. To further speedup generation, we introduce token chunking and progressive acceptance strategy to reduce redundant computation. Experiments on three benchmarks show that our method can achieve up to 2.1x speedup and improve the efficiency of current state-of-the-art methods with minimal compromise to the success rate. Our code is available.
💡 Research Summary
The paper addresses a critical bottleneck in robot control pipelines that rely on video generation policies (VGPs). A VGP consists of a diffusion‑based video generator that predicts future visual states conditioned on the current observation and a task description, followed by an action predictor that extracts executable actions from the generated frames. While recent works have demonstrated impressive predictive quality, the inference time of diffusion models—often requiring hundreds of denoising steps—remains prohibitive for real‑time embodied agents.
Inspired by speculative decoding (SD) in large language models, the authors propose Draft‑and‑Target Sampling (DTS), a training‑free diffusion inference paradigm that eliminates the need for a separate, lightweight draft model. Instead, a single video diffusion model is used in two complementary sampling modes:
-
Draft Sampling – a coarse, fast pass that takes large steps (step size (n_1)) along the DDIM schedule, producing a sparse “draft” denoising trajectory (D = {x_T, x_{T-n_1}, …, x_0}). This quickly approximates the full trajectory but accumulates larger errors due to the aggressive skipping of intermediate steps.
-
Target Sampling – a fine, accurate pass that revisits each draft token in parallel, applying many small steps (step size (n_2 < n_1)) to refine it into a “target” token sequence (\bar{x}). The result is a dense trajectory (G_{all}) that closely matches what would be obtained by the full diffusion process.
After both trajectories are generated, a verification phase compares each draft token with its corresponding target token. If they match (or fall within an acceptance bound), the draft token is accepted; otherwise, the algorithm restarts draft sampling from the first rejected position using the target token as a new seed. This self‑play denoising mechanism ensures that the final video frames retain high fidelity while most of the computation is performed in the cheap draft pass.
To overcome two practical challenges—(a) GPU compute saturation caused by processing long draft sequences in a single batch, and (b) the high dimensionality and diversity of denoising‑trajectory tokens—the authors introduce two orthogonal strategies:
-
Token Chunking – The draft trajectory is split into manageable chunks (e.g., 8‑16 frames). Each chunk undergoes the draft‑target verification independently, allowing the GPU to process smaller batches, reduce memory pressure, and avoid wasteful refinement of tokens that will later be rejected.
-
Progressive Acceptance – Instead of demanding exact equality between draft and target tokens (which is unrealistic for continuous video tokens), the acceptance threshold is gradually tightened across chunks. Early chunks tolerate larger deviations, dramatically cutting the number of resampling events; later chunks enforce stricter similarity, guaranteeing final quality.
The authors evaluate DTS on three established robotic simulation benchmarks:
- iThor (interactive indoor navigation and manipulation),
- Meta‑World (diverse manipulation tasks),
- Libero (complex multi‑step procedural tasks).
They compare against state‑of‑the‑art diffusion‑based policies (DDPM/DDIM solvers) and a recent speculative decoding variant for visual‑language‑action models (Spec‑VLA). Results show:
- Speedup – An average of 2.1× reduction in inference time, with the highest gain of 2.4× on iThor.
- Success Rate Impact – Minimal degradation: ≤ 0.3% drop on iThor, ≤ 1.2% on Meta‑World and Libero, indicating that the coarse‑fine refinement preserves task performance.
- Compute Efficiency – Token chunking cuts GPU memory usage by 30‑45%, and progressive acceptance reduces the proportion of tokens that require costly resampling by over 40%.
The paper also discusses limitations. The choice of step sizes (n_1) and (n_2) is currently heuristic and may need task‑specific tuning. There is no formal bound on error accumulation, so pathological cases could trigger many resampling cycles. Real‑time deployment on physical robots would require tighter integration with robot control loops and possibly further hardware‑aware optimizations.
In conclusion, Draft‑and‑Target Sampling provides a novel, training‑free method to accelerate diffusion‑based video generation policies. By leveraging a single model in two complementary sampling regimes, and by introducing token chunking plus progressive acceptance, the approach achieves substantial speed gains while keeping success rates virtually unchanged. The techniques are broadly applicable to any generative task that operates on high‑dimensional continuous tokens, suggesting future extensions to 3‑D scene synthesis, physics simulation, or multimodal generation pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment