CoFreeVLA: Collision-Free Dual-Arm Manipulation via Vision-Language-Action Model and Risk Estimation

CoFreeVLA: Collision-Free Dual-Arm Manipulation via Vision-Language-Action Model and Risk Estimation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision Language Action (VLA) models enable instruction following manipulation, yet dualarm deployment remains unsafe due to under modeled selfcollisions between arms and grasped objects. We introduce CoFreeVLA, which augments an endtoend VLA with a short horizon selfcollision risk estimator that predicts collision likelihood from proprioception, visual embeddings, and planned actions. The estimator gates risky commands, recovers to safe states via risk-guided adjustments, and shapes policy refinement for safer rollouts. It is pre-trained with model-based collision labels and posttrained on real robot rollouts for calibration. On five bimanual tasks with the PiPER robot arm, CoFreeVLA reduces selfcollisions and improves success rates versus RDT and APEX.


💡 Research Summary

CoFreeVLA addresses a critical safety gap in dual‑arm robotic manipulation that arises when Vision‑Language‑Action (VLA) models are deployed without explicit self‑collision awareness. While VLA models excel at grounding natural‑language instructions into visual and proprioceptive observations, they typically rely on external planners or simple geometric checks to avoid collisions with the environment, leaving inter‑arm and arm‑object collisions under‑modeled. CoFreeVLA augments an end‑to‑end VLA with a short‑horizon self‑collision risk estimator that predicts the probability of a collision within a few future timesteps, together with auxiliary signals such as minimum inter‑body distance and time‑to‑collision.

The risk estimator receives the current dual‑arm state (joint positions, gripper states), a visual embedding extracted from the VLA backbone, and a candidate action sequence of length H (usually 5–10 steps) proposed by the VLA policy. A lightweight cross‑attention architecture fuses proprioceptive and visual streams, producing calibrated risk scores r̂∈


Comments & Academic Discussion

Loading comments...

Leave a Comment