PALM: Enhanced Generalizability for Local Visuomotor Policies via Perception Alignment

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generalizing beyond the training domain in image-based behavior cloning remains challenging. Existing methods address individual axes of generalization, workspace shifts, viewpoint changes, and cross-embodiment transfer, yet they are typically developed in isolation and often rely on complex pipelines. We introduce PALM (Perception Alignment for Local Manipulation), which leverages the invariance of local action distributions between out-of-distribution (OOD) and demonstrated domains to address these OOD shifts concurrently, without additional input modalities, model changes, or data collection. PALM modularizes the manipulation policy into coarse global components and a local policy for fine-grained actions. We reduce the discrepancy between in-domain and OOD inputs at the local policy level by enforcing local visual focus and consistent proprioceptive representation, allowing the policy to retrieve invariant local actions under OOD conditions. Experiments show that PALM limits OOD performance drops to 8% in simulation and 24% in the real world, compared to 45% and 77% for baselines.

💡 Research Summary

The paper tackles a fundamental problem in image‑based robot manipulation: policies trained by behavior cloning (BC) tend to over‑fit to the visual and proprioceptive distribution of the demonstration data and collapse when deployed in out‑of‑distribution (OOD) settings such as workspace shifts, camera viewpoint changes, or cross‑embodiment transfers. Existing works address each of these axes separately, often requiring large‑scale data collection, synthetic data generation, additional sensors (e.g., eye‑in‑hand cameras), or complex model modifications. Consequently, there is no unified, lightweight solution that can handle all three OOD shifts simultaneously.

PALM (Perception Alignment for Local Manipulation) proposes a two‑stage modular policy architecture combined with a set of input‑level alignment operations that require no extra data, sensors, or model changes. The architecture separates manipulation into:

Coarse Global Policy – an analytical controller that reasons over the full third‑person image to locate the target object and moves the end‑effector (EE) to a region near the object. This stage does not need learning; it can be a simple geometric estimator or a more sophisticated planner in future work.
Fine‑grained Local Policy – a BC model (ResNet‑18 backbone in the experiments) trained only on the local interaction data. It receives a cropped, aligned visual observation together with a carefully designed proprioceptive vector and outputs the precise EE motion needed for grasping, insertion, or other contact‑rich actions.

The core contribution lies in how PALM aligns the inputs to the local policy:

Visual Alignment
- TCP‑Centric Crop – The tool‑center point (TCP) of the robot is projected into pixel coordinates using known camera intrinsics and extrinsics. A fixed‑size square (κ×κ) is then cropped around this projection. This operation removes most of the variation caused by workspace translation, camera rotation, and robot morphology, while preserving the region where the interaction actually occurs.
- TCP Overlay – Three orthogonal axes of the TCP are rendered in distinct colours on top of the cropped image. The overlay supplies a consistent visual cue about the robot’s pose that is independent of the robot’s physical appearance, thereby aiding cross‑embodiment transfer.
- Data Augmentation – Random image overlays (as in Random Overlay) and mild perspective warps are applied to the cropped view. These augmentations encourage the network to focus on task‑relevant features and improve robustness to lighting, background clutter, and distractors.
Proprioceptive Alignment
- Height‑Only Translation – The (x, y) components of the EE position are omitted; only the vertical height (z) is kept. The missing planar coordinates can be inferred from the visual crop, so removing them eliminates dependence on a global reference frame and mitigates the impact of workspace shifts.
- Camera‑Frame Rotation – The EE rotation matrix is expressed in the camera coordinate frame (C R_H = R_Cᵀ R_H) and encoded with a 6‑D continuous representation. This makes the rotation invariant to camera viewpoint changes.
- Binary Gripper State – Gripper open/close is represented as a single binary value, abstracting away differences in gripper geometry across robots.

All these steps are performed as a pre‑processing pipeline before feeding data to the BC network. Consequently, PALM can be plugged into any existing BC implementation without architectural changes.

Experimental Evaluation
The authors evaluate PALM on four RLBench tasks (Lift Lid, Lift Spam, Insert Peg, Rearrange Veggies) in simulation and on two real‑world tasks (Drawer, Stack) using a Franka Panda robot for training and a UR5 + Robotiq gripper for OOD testing. The OOD conditions include:

Workspace shift – test objects placed in a larger area than seen during training.
Viewpoint shift – camera rotated up to ±30° around the vertical axis.
Cross‑embodiment – switching to a different robot arm and gripper.

Baselines include MirrorDuo (mirroring augmentation for workspace transfer), RoVi‑Aug (view synthesis and diffusion‑based appearance transfer), ARRO (segmentation‑based background removal), and vanilla BC. All methods share the same analytical global policy; only the local policy differs.

Results show that PALM dramatically reduces OOD performance degradation: in simulation the average normalized drop falls from 45 % (vanilla BC) to 8 %, and in the real world from 77 % to 24 %. Ablation studies confirm that each component (TCP crop, overlay, augmentation, proprioceptive alignment) contributes positively; removing the crop or rotation alignment leads to the largest drops.

Insights and Limitations
PALM’s success stems from the observation that local action distributions are invariant across domains: the fine‑grained motion needed to pick up a cube is essentially the same whether the cube is on the left or right side of the table, as long as the robot’s visual focus is centered on the interaction region. By forcing the network to see a canonical local view and a canonical proprioceptive representation, PALM eliminates spurious correlations with global scene layout, camera pose, or robot geometry.

The approach is lightweight, requiring only known camera intrinsics/extrinsics and a simple cropping routine, making it attractive for real‑world deployment where collecting additional data or training large generative models is impractical. However, PALM assumes that the task’s critical information is confined to a relatively small region around the EE; highly contact‑rich or multi‑object manipulation where important cues lie far from the gripper may not benefit as much. Moreover, the analytical global policy used in the paper is deliberately simple; integrating PALM with more sophisticated planners or foundation‑model‑based global policies could further improve performance on long‑horizon tasks.

Conclusion
PALM introduces a practical, model‑agnostic strategy for simultaneous generalization across workspace, viewpoint, and embodiment shifts in image‑based robot manipulation. By aligning visual and proprioceptive inputs at the local policy level, it achieves state‑of‑the‑art OOD robustness without extra data, sensors, or architectural changes. The work opens avenues for combining perception alignment with advanced global planners and extending the method to more complex, contact‑intensive manipulation scenarios.

PALM: Enhanced Generalizability for Local Visuomotor Policies via Perception Alignment

💡 Research Summary

Comments & Academic Discussion

Leave a Comment