Less Is More: Scalable Visual Navigation from Limited Data

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Imitation learning provides a powerful framework for goal-conditioned visual navigation in mobile robots, enabling obstacle avoidance while respecting human preferences and social norms. However, its effectiveness depends critically on the quality and diversity of training data. In this work, we show how classical geometric planners can be leveraged to generate synthetic trajectories that complement costly human demonstrations. We train Less is More (LiMo), a transformer-based visual navigation policy that predicts goal-conditioned SE(2) trajectories from a single RGB observation, and find that augmenting limited expert demonstrations with planner-generated supervision yields substantial performance gains. Through ablations and complementary qualitative and quantitative analyses, we characterize how dataset scale and diversity affect planning performance. We demonstrate real-robot deployment and argue that robust visual navigation is enabled not by simply collecting more demonstrations, but by strategically curating diverse, high-quality datasets. Our results suggest that scalable, embodiment-specific geometric supervision is a practical path toward data-efficient visual navigation.

💡 Research Summary

The paper tackles a fundamental bottleneck in learning‑based visual navigation for mobile robots: the scarcity of high‑quality human‑demonstrated data. While classical geometric pipelines (mapping, traversability estimation, path planning) excel at collision avoidance, they lack the ability to encode semantic preferences, social norms, and nuanced terrain understanding. Recent imitation‑learning (IL) approaches such as ViNT, NoMaD, and FlowNav have shown promise by training transformer‑based policies on large, multi‑embodiment datasets, but the collection of expert demonstrations remains costly and often limited to a few hours of operation.

To bridge this gap, the authors propose a two‑pronged data‑augmentation strategy built on top of the GrandTour dataset, a high‑quality collection of teleoperated ANYmal quadruped missions. First, they extract a “teleop” subset (D_TEL) consisting of short, goal‑conditioned trajectory segments paired with front‑camera RGB images. This subset preserves human preferences, social‑aware behaviors, and embodiment‑specific motion characteristics. Second, they generate a synthetic “geometric” subset (D_GEO) by running a Model‑Predictive Path Integral (MPPI) planner on the same RGB‑elevation pairs. For each frame they sample K random goal poses from a Gaussian distribution in the robot‑centric frame, compute a traversability map using a CNN trained for the ANYmal embodiment, derive a geodesic distance field, and construct a cost function J that balances traversability, goal distance, and control effort. MPPI then produces feasible SE(2) waypoint sequences for each sampled goal.

The union of D_TEL and D_GEO yields the augmented dataset D_AUG, which is used to train the proposed policy, LiMo (Less is More). LiMo receives a single RGB observation I and a goal pose g∈SE(2) and predicts a fixed‑length sequence of N waypoints s₁…s_N, each expressed as (x, y, θ). The architecture consists of a Vision Transformer (ViT‑B) encoder for the image, a learned embedding for the goal pose, and a transformer decoder that autoregressively generates the waypoint tokens. Training minimizes an L2 behavior‑cloning loss between predicted and ground‑truth waypoints, supplemented by a smoothness regularizer that penalizes abrupt changes between consecutive waypoints.

Extensive experiments on held‑out GrandTour missions compare three training regimes: (i) D_TEL only, (ii) D_GEO only, and (iii) the full D_AUG. Performance is measured using Success weighted by Path Length (SPL). The D_AUG‑trained LiMo achieves a substantial increase of roughly 12 % SPL over the D_TEL‑only baseline, demonstrating that synthetic geometric supervision dramatically improves both success rate and path efficiency, especially in cluttered, low‑light, or highly uneven terrain where human teleoperation data are sparse. Qualitative analyses show that LiMo learns to exploit visible structures (e.g., cutting corners when safe) and to exhibit socially compliant behaviors such as yielding to humans or avoiding narrow passages.

Real‑world deployment on an ANYmal robot validates LiMo as a drop‑in local planner. Compared to a pure MPPI planner, LiMo reaches goals faster, follows smoother trajectories, and respects semantic constraints learned from human demonstrations. The policy also gracefully handles unreachable goals by generating safe exploratory motions rather than failing catastrophically.

The authors acknowledge several limitations. MPPI‑generated trajectories rely on accurate elevation maps; any degradation in map quality at deployment could affect supervision fidelity. The goal‑conditioning is pose‑based rather than image‑based, limiting the ability to specify goals via visual cues. Finally, the study focuses on a single embodiment (ANYmal), leaving cross‑embodiment generalization an open question. Future work is suggested to incorporate real‑time depth estimation, multi‑embodiment training, and multimodal goal specifications (language, images).

In summary, the paper demonstrates that strategic curation of a modest amount of expert data combined with scalable geometric supervision can yield a data‑efficient, high‑performing visual navigation policy. By leveraging the complementary strengths of human intuition and classical planning, LiMo achieves robust, semantically aware navigation without the need for massive demonstration collections, offering a practical pathway toward scalable, embodiment‑specific visual navigation.

Less Is More: Scalable Visual Navigation from Limited Data

💡 Research Summary

Comments & Academic Discussion

Leave a Comment