EquiContact: A Hierarchical SE(3) Vision-to-Force Equivariant Policy for Spatially Generalizable Contact-rich Tasks

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper presents a framework for learning vision-based robotic policies for contact-rich manipulation tasks that generalize spatially across task configurations. We focus on achieving robust spatial generalization of the policy for the peg-in-hole (PiH) task trained from a small number of demonstrations. We propose EquiContact, a hierarchical policy composed of a high-level vision planner (Diffusion Equivariant Descriptor Field, Diff-EDF) and a novel low-level compliant visuomotor policy (Geometric Compliant ACT, G-CompACT). G-CompACT operates using only localized observations (geometrically consistent error vectors (GCEV), force-torque readings, and wrist-mounted RGB images) and produces actions defined in the end-effector frame. Through these design choices, we show that the entire EquiContact pipeline is SE(3)-equivariant, from perception to force control. We also outline three key components for spatially generalizable contact-rich policies: compliance, localized policies, and induced equivariance. Real-world experiments on PiH, screwing, and surface wiping tasks demonstrate a near-perfect success rate and robust generalization to unseen spatial configurations, validating the proposed framework and principles. The experimental videos and more details can be found on the project website: https://equicontact.github.io/EquiContact-website/

💡 Research Summary

EquiContact introduces a hierarchical framework that achieves spatial generalization for contact‑rich manipulation by enforcing SE(3) equivariance from perception to force control. The system consists of two main components: a high‑level vision planner called Diffusion Equivariant Descriptor Field (Diff‑EDF) and a low‑level compliant visuomotor policy named Geometric Compliant Action Chunking Transformer (G‑CompACT). Diff‑EDF processes point‑cloud data from external cameras and predicts a reference frame for the target object (e.g., a hole) in the world coordinate system. This reference frame is then used to anchor the low‑level policy, which operates solely on local observations: a geometrically consistent error vector (GCEV) that encodes the pose error between the current end‑effector and the reference frame, force‑torque sensor readings expressed in the end‑effector frame, and RGB images from wrist‑mounted cameras.

G‑CompACT extends the Action Chunking Transformer (ACT) architecture. Visual features are extracted with a CLIP‑ResNet50 backbone, modulated by task‑description text via FiLM layers, and combined with the GCEV and force‑torque inputs in a transformer decoder. The decoder outputs a relative pose command (g_rel) and admittance gains (K_p, K_R). These are fed to a Geometric Admittance Controller (GAC) that generates compliant motion commands directly in the end‑effector frame. Because both the observation (GCEV, forces, images) and the action (relative pose, admittance gains) are defined in the same local frame, the entire pipeline is left‑invariant: applying any SE(3) transformation to the world results in an equivalent transformation of the internal representations, preserving behavior without retraining.

The authors formalize three design principles that guarantee spatial generalization: (1) left‑invariant compliant control, (2) a localized policy that only depends on end‑effector‑centric data, and (3) induced equivariance through the high‑level planner’s reference frame. They provide mathematical proofs that these conditions are sufficient for SE(3) vision‑to‑force equivariance, even though the neural networks themselves are not explicitly equivariant.

Experiments were conducted on three contact‑rich tasks: peg‑in‑hole (PiH), screw insertion, and surface wiping. For PiH, which requires sub‑millimeter accuracy, Diff‑EDF alone could not achieve the needed precision, but G‑CompACT’s real‑time visual feedback and force‑based compliance corrected residual errors, yielding near‑perfect success (>98 %). The policies were trained on only 20–30 demonstrations collected in a fixed configuration, yet they generalized to unseen configurations involving arbitrary translations and rotations (including 90°, 135°, and 45° transformations) with only minor performance loss. Similar success rates (≈95 % for screwing, ≈93 % for wiping) were observed when the same framework was applied without any task‑specific redesign.

Key advantages highlighted include sample efficiency (few demonstrations suffice), real‑time feasibility (the equivariant structure avoids costly group‑convolution operations), and extensibility to other contact‑rich tasks. Limitations involve dependence on the accuracy of the high‑level planner’s reference pose; large errors there increase the burden on the low‑level controller. Additionally, the current implementation assumes reliable wrist‑camera visibility and does not handle large occlusions or rapid changes in object pose during execution.

Future work is suggested to incorporate multi‑camera fusion, dynamic object tracking, and to test the approach on more complex manipulators and multi‑step assembly pipelines. In summary, EquiContact demonstrates that enforcing SE(3) equivariance through structured perception‑to‑control pipelines enables robust, data‑efficient, and spatially generalizable policies for precise contact‑rich manipulation, offering a practical alternative to large‑scale imitation‑learning approaches.

EquiContact: A Hierarchical SE(3) Vision-to-Force Equivariant Policy for Spatially Generalizable Contact-rich Tasks

💡 Research Summary

Comments & Academic Discussion

Leave a Comment