Information Filtering via Variational Regularization for Robot Manipulation
Diffusion-based visuomotor policies built on 3D visual representations have achieved strong performance in learning complex robotic skills. However, most existing methods employ an oversized denoising decoder. While increasing model capacity can improve denoising, empirical evidence suggests that it also introduces redundancy and noise in intermediate feature blocks. Crucially, we find that randomly masking backbone features at inference time (without changing training) can improve performance, confirming the presence of task-irrelevant noise in intermediate features. To this end, we propose Variational Regularization (VR), a lightweight module that imposes a timestep-conditioned Gaussian over backbone features and applies a KL-divergence regularizer, forming an adaptive information bottleneck. Extensive experiments on three simulation benchmarks (RoboTwin2.0, Adroit, and MetaWorld) show that, compared to the baseline DP3, our approach improves the success rate by 6.1% on RoboTwin2.0 and by 4.1% on Adroit and MetaWorld, achieving new state-of-the-art results. Real-world experiments further demonstrate that our method performs well in practical deployments. Code will released.
💡 Research Summary
The paper investigates a hidden inefficiency in state‑of‑the‑art diffusion‑based visuomotor policies for robot manipulation, specifically the 3D Diffusion Policy (DP3). DP3 combines a lightweight point‑cloud encoder that produces a compact 64‑dimensional scene embedding with a massive U‑Net denoising decoder (≈250 M parameters). While the large decoder improves raw denoising capacity, the authors hypothesize that its intermediate backbone features become redundant and noisy, especially given the already concise conditioning signal.
To test this hypothesis, they conduct a series of inference‑time masking experiments. By randomly dropping backbone features (both channel‑wise and point‑wise) at various probabilities, they observe that performance often improves when a certain fraction of the features is masked, indicating that many of those activations carry task‑irrelevant noise. Skip‑connection features show a more nuanced effect, suggesting that the bottleneck lies primarily in the deep backbone after down‑sampling.
Motivated by these findings, the authors propose Variational Regularization (VR), a lightweight module inserted immediately after the final down‑sampling stage of the U‑Net. VR treats the backbone feature tensor Z as the input to a conditional Gaussian distribution whose mean μθ(Z, t) and variance σθ²(Z, t) are predicted by a small network conditioned on both Z and the diffusion timestep t. Using the re‑parameterization trick, a filtered representation (\hat Z) is sampled and fed into the subsequent up‑sampling blocks. During training, a KL‑divergence term KL(pθ((\hat Z) | Z, t)‖N(0, I)) is added to the standard denoising loss, forming an adaptive information bottleneck. This formulation aligns with the Variational Information Bottleneck (VIB) framework: the KL term penalizes mutual information between Z and (\hat Z), encouraging compression of irrelevant variations while preserving task‑relevant signals. Because the diffusion process injects different noise levels at each timestep, conditioning on t allows the bottleneck to adapt its strength dynamically.
Extensive experiments are performed on three simulation suites—RoboTwin2.0, Adroit, and MetaWorld—as well as on real‑world robot setups. Compared to the baseline DP3, VR yields a 6.1 % absolute increase in success rate on RoboTwin2.0 and a 4.1 % increase on both Adroit and MetaWorld, establishing new state‑of‑the‑art results. Ablation studies confirm that (1) the performance gain stems from the KL regularizer rather than merely from additional parameters, (2) timestep conditioning is essential for optimal filtering, and (3) VR’s effect is robust across different masking schemes and task categories. The module adds only a few thousand parameters and negligible compute overhead, making it a plug‑and‑play enhancement for existing diffusion policies.
The paper contributes (i) empirical evidence that large diffusion decoders can harbor task‑irrelevant noise, (ii) a principled, lightweight variational bottleneck that adaptively filters this noise, and (iii) a thorough theoretical connection to VIB and ELBO maximization. Limitations include the focus on a single bottleneck location; future work could explore multi‑layer regularization, alternative posterior families (e.g., mixture models), or integration with encoder‑side information bottlenecks. Overall, the work demonstrates that smarter information flow control, rather than sheer model size, is key to advancing diffusion‑based robot manipulation policies.
Comments & Academic Discussion
Loading comments...
Leave a Comment