DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Data scarcity fundamentally limits the generalization of bimanual dexterous manipulation, as real-world data collection for dexterous hands is expensive and labor-intensive. Human manipulation videos, as a direct carrier of manipulation knowledge, offer significant potential for scaling up robot learning. However, the substantial embodiment gap between human hands and robotic dexterous hands makes direct pretraining from human videos extremely challenging. To bridge this gap and unleash the potential of large-scale human manipulation video data, we propose DexImit, an automated framework that converts monocular human manipulation videos into physically plausible robot data, without any additional information. DexImit employs a four-stage generation pipeline: (1) reconstructing hand-object interactions from arbitrary viewpoints with near-metric scale; (2) performing subtask decomposition and bimanual scheduling; (3) synthesizing robot trajectories consistent with the demonstrated interactions; (4) comprehensive data augmentation for zero-shot real-world deployment. Building on these designs, DexImit can generate large-scale robot data based on human videos, either from the Internet or video generation models. DexImit is capable of handling diverse manipulation tasks, including tool use (e.g., cutting an apple), long-horizon tasks (e.g., making a beverage), and fine-grained manipulations (e.g., stacking cups).

💡 Research Summary

DexImit tackles the chronic data scarcity problem in bimanual dexterous manipulation by automatically converting monocular human manipulation videos into physically plausible robot demonstrations. The framework consists of four tightly coupled stages that together bridge the embodiment gap without requiring any depth sensors, camera intrinsics, or manual annotation.

Stage 1 – Depth‑free 4D reconstruction.
Given an RGB video captured from an arbitrary viewpoint, DexImit first employs a vision‑language model (Qwen3‑VL) to identify all objects involved in the task. Frame‑wise semantic segmentation (Grounded‑SAM2) yields masks for the two hands, the manipulated objects, and the supporting table. A monocular depth estimator (SpatialTracker v2) provides an unscaled depth map, which is then metrically calibrated using the known size distribution of human hands. Hand meshes are recovered with the WILOR model, while object meshes are generated by SAM3D. An iterative align‑render‑align procedure aligns the meshes to the depth map, producing near‑metric scale hand and object point clouds. 6‑DoF poses for both hands and objects are refined with a tracking variant of FoundationPose, ensuring temporal continuity. Finally, a world coordinate system is defined by extracting the table normal (world‑z) and the bisector of the initial hand positions (world‑x), allowing all trajectories to be expressed in a common frame.

Stage 2 – Subtask decomposition and Action‑Centric scheduling.
The reconstructed trajectories are fed back to Qwen3‑VL for automatic task parsing. Each manipulation is represented as a sequence of sub‑actions of four primitive types (pre‑grasp, grasp, motion, release) together with the involved embodiment (left hand, right hand, or both). DexImit introduces an Action‑Centric Scheduling algorithm that maintains a priority queue of active tasks and dynamically allocates sub‑actions to the two robotic hands. The scheduler respects arbitrary horizons, varying degrees of concurrency, and asynchronous execution, thereby supporting simple unimanual grasps, coordinated bimanual grasps, and fully concurrent actions such as pouring.

Stage 3 – Structured action generation.
With the scheduled sub‑actions, DexImit synthesizes grasps that satisfy force‑closure constraints. Hand‑object contact is modeled using MANO‑prompts combined with object physical properties, ensuring that the generated grasps are stable under realistic contact forces. A key‑frame‑based motion planner then computes collision‑free joint trajectories for the 24‑DoF dexterous hands, respecting joint limits and dynamic feasibility. The resulting robot trajectories are verified in simulation before being exported as training data.

Stage 4 – Comprehensive data augmentation.
To close the sim‑to‑real gap, DexImit augments the generated dataset by randomizing object poses and scales, camera extrinsics, lighting, and visual textures. This augmentation dramatically increases the diversity of observations and forces the downstream policy to learn robust, domain‑invariant representations.

Experiments and results.
DexImit was evaluated on a suite of ten manipulation domains ranging from tool use (cutting an apple) and long‑horizon tasks (making a beverage) to fine‑grained stacking. Training policies on the synthetic DexImit dataset enabled zero‑shot transfer to a real dual‑hand robot platform, achieving success rates above 70 % across all tasks. Ablation studies confirmed that each component—metric‑scale reconstruction, Action‑Centric scheduling, and force‑closure grasp synthesis—contributes significantly to overall performance. Compared with prior human‑video‑to‑robot pipelines that rely on depth sensors or precise camera calibration, DexImit reduces the data collection burden dramatically while expanding the range of feasible tasks.

Significance and limitations.
DexImit demonstrates that large‑scale, in‑the‑wild human videos can be turned into high‑fidelity robot data without any additional hardware. Its modular pipeline is compatible with both real internet videos and synthetic videos generated by text‑to‑video models, opening a path toward virtually unlimited training data for dexterous manipulation. However, the approach still depends on the accuracy of off‑the‑shelf segmentation, depth estimation, and pose‑estimation networks; extreme occlusions, fast motions, or highly reflective objects can degrade reconstruction quality. Moreover, force‑closure grasp synthesis assumes known object friction and geometry, which may limit generalization to novel objects with unknown physical parameters. Future work could integrate self‑supervised refinement loops and online domain adaptation to further close the reality gap.

DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos

💡 Research Summary

Comments & Academic Discussion

Leave a Comment