Bi-Adapt: Few-shot Bimanual Adaptation for Novel Categories of 3D Objects via Semantic Correspondence

Bi-Adapt: Few-shot Bimanual Adaptation for Novel Categories of 3D Objects via Semantic Correspondence
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Bimanual manipulation is imperative yet challenging for robots to execute complex tasks, requiring coordinated collaboration between two arms. However, existing methods for bimanual manipulation often rely on costly data collection and training, struggling to generalize to unseen objects in novel categories efficiently. In this paper, we present Bi-Adapt, a novel framework designed for efficient generalization for bimanual manipulation via semantic correspondence. Bi-Adapt achieves cross-category affordance mapping by leveraging the strong capability of vision foundation models. Fine-tuning with restricted data on novel categories, Bi-Adapt exhibits notable generalization to out-of-category objects in a zero-shot manner. Extensive experiments conducted in both simulation and real-world environments validate the effectiveness of our approach and demonstrate its high efficiency, achieving a high success rate on different benchmark tasks across novel categories with limited data. Project website: https://biadapt-project.github.io/


💡 Research Summary

The paper introduces Bi‑Adapt, a framework that enables robots to perform complex bimanual manipulation tasks on previously unseen object categories with only a few demonstrations. The core idea is to combine the semantic correspondence capabilities of large‑scale vision foundation models with a lightweight, two‑stage learning pipeline that first captures manipulation knowledge on a small “supporting set” of known objects and then transfers this knowledge to novel categories.

1. Supporting‑set learning.
The authors construct a supporting set consisting of several object categories (e.g., scissors, pliers) for which dense point‑level affordance data are collected. They design two coupled perception modules, M₁ (first gripper) and M₂ (second gripper). Each module contains an Action Proposal Network (A) that predicts a gripper orientation given a contact point, and an Action Scoring Network (C) that evaluates the suitability of the proposed action. To enforce collaboration, training proceeds in reverse order: M₂ is trained first to generate a viable second‑gripper action for a wide variety of first‑gripper inputs; subsequently M₁ is trained to propose first‑gripper contacts that facilitate the already‑learned M₂ behavior. This reversed data‑flow reduces the combinatorial explosion of joint actions while explicitly modeling inter‑hand dependency.

2. Semantic affordance transfer.
When a new object appears, the system retrieves successful contact‑point pairs from the supporting set for the same task. Using a vision foundation model (e.g., DIFT, DINOv2), it extracts diffusion‑based features from both the source image (with known contacts) and the target image (the novel object). By computing cosine similarity between source and target pixel features, the most semantically corresponding pixels are identified, and the associated 2‑D points are back‑projected into 3‑D using depth information. Multiple source examples generate a set of candidate contact‑point pairs on the novel object. Because the foundation model’s predictions are not guaranteed to be physically feasible, these candidates must be filtered.

3. Few‑shot adaptation.
The authors introduce a fast adaptation stage that fine‑tunes the perception modules using a handful of real robot interactions on the novel category (typically fewer than 20 trials). The pre‑trained modules propose actions for each candidate pair; the pair with the highest predicted success likelihood is executed, and the observed outcome (success or failure) is used to update the A and C networks via gradient descent. This loop simultaneously refines contact‑point selection and orientation prediction, allowing the system to overcome errors introduced by the semantic mapping and to adapt to variations in geometry, material, and friction.

4. Experimental validation.
Experiments are conducted in both simulation and on a real dual‑arm robot across five benchmark tasks: unfolding, opening, closing (rotational joints), uncapping, and capping (prismatic joints). Success criteria are defined in terms of joint angle change or part separation distance. Using only three to four categories in the supporting set, Bi‑Adapt achieves an average success rate above 85 % on twenty unseen objects spanning novel categories. Ablation studies show that using only the foundation‑model affordance (without few‑shot fine‑tuning) yields roughly 60 % success, while the full pipeline reaches the reported performance, demonstrating the critical role of the adaptation stage.

5. Limitations and future work.
The approach relies heavily on the quality of the input images and the viewpoint; poor lighting or occlusions can degrade semantic correspondence. The current design assumes symmetric bimanual actions (both hands perform similar primitives), which limits applicability to asymmetric tasks where one hand stabilizes while the other manipulates. Future directions include multi‑view feature fusion, handling dynamic or deformable objects, and extending the policy to heterogeneous hand roles.

In summary, Bi‑Adapt provides a practical solution to the data‑efficiency bottleneck in bimanual manipulation by leveraging foundation‑model semantics for cross‑category affordance transfer and a lightweight few‑shot fine‑tuning loop. It demonstrates that high‑level visual knowledge can be effectively grounded in precise robot actions, opening avenues for scalable, open‑world dual‑arm manipulation.


Comments & Academic Discussion

Loading comments...

Leave a Comment