Self-Imitated Diffusion Policy for Efficient and Robust Visual Navigation
Diffusion policies (DP) have demonstrated significant potential in visual navigation by capturing diverse multi-modal trajectory distributions. However, standard imitation learning (IL), which most DP methods rely on for training, often inherits sub-optimality and redundancy from expert demonstrations, thereby necessitating a computationally intensive “generate-then-filter” pipeline that relies on auxiliary selectors during inference. To address these challenges, we propose Self-Imitated Diffusion Policy (SIDP), a novel framework that learns improved planning by selectively imitating a set of trajectories sampled from itself. Specifically, SIDP introduces a reward-guided self-imitation mechanism that encourages the policy to consistently produce high-quality trajectories efficiently, rather than outputs of inconsistent quality, thereby reducing reliance on extensive sampling and post-filtering. During training, we employ a reward-driven curriculum learning paradigm to mitigate inefficient data utility, and goal-agnostic exploration for trajectory augmentation to improve planning robustness. Extensive evaluations on a comprehensive simulation benchmark show that SIDP significantly outperforms previous methods, with real-world experiments confirming its effectiveness across multiple robotic platforms. On Jetson Orin Nano, SIDP delivers a 2.5$\times$ faster inference than the baseline NavDP, i.e., 110ms VS 273ms, enabling efficient real-time deployment.
💡 Research Summary
The paper addresses two major drawbacks of current diffusion‑based visual navigation policies that rely on imitation learning (IL) from expert demonstrations: (1) the inherited sub‑optimality and redundancy of the demonstration set, which limits robustness under distribution shift, and (2) the need for a “generate‑then‑filter” inference pipeline that samples many trajectories and uses an auxiliary selector to pick a high‑quality plan, incurring substantial latency on resource‑constrained platforms.
To overcome these issues, the authors propose the Self‑Imitated Diffusion Policy (SIDP), a framework in which the policy learns from its own high‑reward trajectories rather than from external experts. The core mechanism is a reward‑guided self‑imitation loop: at each training step the current diffusion policy πθ samples N candidate trajectories, each is evaluated by a composite reward function r(s,a) that combines safety (collision penalty), efficiency (step‑cost), progress toward the goal, and a docking term. The top‑k trajectories with the highest rewards are retained, and importance weights are computed as a Boltzmann distribution w_i = exp(r_i/τ) normalized over the selected set. These weights correspond exactly to the optimal distribution p∗(a|s) ∝ πθ(a|s)·exp(r/τ) derived from a KL‑constrained reinforcement‑learning formulation (REPS). Consequently, maximizing the weighted log‑likelihood of the selected trajectories is equivalent to minimizing a reward‑weighted denoising loss L_SIDP, which can be back‑propagated through the diffusion denoising network without the need for back‑propagation through time (BPTT).
Two complementary strategies are introduced to improve data efficiency and generalization. First, goal‑agnostic exploration augments training with randomly sampled auxiliary goals (uniform angular range ±60°, distance 3–5 m). The policy generates feasible paths toward these goals while the goal input is replaced by a special embedding and the importance weights are set uniformly. This decouples the policy’s behavior from any specific goal, diversifies the trajectory pool, and regularizes point‑goal navigation. Second, reward‑driven curriculum learning dynamically selects training scenarios based on two metrics: the maximum reward R_max and the reward range R_range of the sampled trajectories. Only scenarios satisfying thresholds τ_max and τ_range are used, ensuring that the policy focuses on informative, learnable experiences and avoids noisy, low‑reward samples that could destabilize training.
Because SIDP concentrates the trajectory distribution around high‑quality samples, the inference pipeline no longer requires dense sampling and a post‑hoc selector. A single deterministic diffusion pass (or a small number of denoising steps) yields a viable plan, dramatically reducing latency. On an edge device (NVIDIA Jetson Orin Nano) SIDP achieves 110 ms inference per planning query, a 2.5× speed‑up over the baseline NavDP (273 ms) while preserving navigation performance.
Extensive evaluation is performed on the InternVLA‑N1 S1 benchmark and the more challenging InternScene‑Commercial set. SIDP surpasses NavDP by roughly 10 percentage points in both Success Rate (SR) and Success weighted by Path Length (SPL), demonstrating superior robustness to the sim‑to‑real domain gap. Real‑world experiments on two distinct robotic platforms (a mobile base and a manipulator‑mounted robot) confirm that SIDP can reliably avoid obstacles, reach goals, and execute docking maneuvers without auxiliary filtering. The self‑generated trajectories are continuously re‑used for imitation, improving data efficiency and eliminating dependence on external expert datasets.
In summary, SIDP contributes (1) a self‑imitation learning paradigm that aligns diffusion policies with an optimal reward‑conditioned distribution, (2) a practical curriculum and exploration scheme that enhances sample diversity and training stability, and (3) a streamlined inference architecture that enables real‑time, robust visual navigation on low‑power hardware. The work represents a significant step toward deploying diffusion‑based planners in practical robotics applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment