EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos

Reading time: 4 minute
...

📝 Original Info

  • Title: EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos
  • ArXiv ID: 2601.01050
  • Date: 2026-01-03
  • Authors: Hongming Fu, Wenjia Wang, Xiaozhen Qiao, Shuo Yang, Zheng Liu, Bo Zhao

📝 Abstract

We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras in the wild. Accurate W-HOI reconstruction is critical for understanding human behavior and enabling applications in embodied intelligence and virtual reality. However, existing hand-object interactions (HOI) methods are limited to single images or camera coordinates, failing to model temporal dynamics or consistent global trajectories. Some recent approaches attempt world-space hand estimation but overlook object poses and HOI constraints. Their performance also suffers under severe camera motion and frequent occlusions common in egocentric in-the-wild videos. To address these challenges, we introduce a multi-stage framework with a robust pre-process pipeline built on newly developed spatial intelligence models, a whole-body HOI prior model based on decoupled diffusion models, and a multi-objective test-time optimization paradigm. Our HOI prior model is template-free and scalable to multiple objects. In experiments, we prove our method achieving stateof-the-art performance in W-HOI reconstruction.

💡 Deep Analysis

📄 Full Content

Understanding HOI from egocentric videos is a fundamental problem in computer vision and embodied intelligence.

Reconstructing accurate world-space HOI meshes-capturing both spatial geometry and temporal dynamics-is crucial for analyzing human manipulation behavior and enabling downstream applications in embodied AI, robotics, and virtual/augmented reality. Compared to third-person observation, egocentric videos provide richer cues about how humans perceive and act on objects from † Project lead ‡ Corresponding author

World-Space HOI their own perspective. However, these videos are typically recorded by dynamic cameras in highly unconstrained environments, where frequent occlusions, motion blur, and complex hand-object motion make robust 3D reconstruction extremely challenging. To fully interpret and model human actions, one must recover temporally coherent trajectories of both hands and objects in world coordinates, beyond per-frame geometry in the camera coordinates. Despite rapid progress in 3D hand and HOI reconstruction, existing methods remain limited when applied to egocentric settings. Most approaches operate at the image or short-sequence level, estimating 3D hand poses [21,22] and object poses [2,13] frame by frame without enforcing longterm temporal consistency. Moreover, almost all prior HOI and object 6DoF estimation frameworks predict results in camera coordinates [2,7,13,34,36,37], which change dynamically as the wearer moves, making it impossible to obtain consistent global trajectories over time. Some recent works [36,37] incorporate differentiable rendering to im-prove spatial alignment, but these methods are often sensitive to noises and unstable in highly dynamic real-world conditions. Additionally, while egocentric videos inherently encode structural cues between the camera, body, and hands, existing approaches rarely exploit such coupling priors to stabilize motion estimation.

Reconstructing in-the-wild world-space hand-object interactions remains highly challenging. The entanglement of camera and local hand/object motion complicates global trajectory recovery and hinders world-aligned estimation. Real-world scenarios involve unknown objects, demanding template-free reconstruction that generalizes across categories, shapes, and quantities. Robust estimation under occlusion and motion blur is difficult for methods relying on per-frame recognition or differentiable rendering. Furthermore, maintaining spatial-temporal coherence over long egocentric sequences while preventing drift and ensuring plausibility remains an open challenge.

To address these challenges, we propose EgoGrasp, to our knowledge, the first method that reconstructs worldspace hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras. EgoGrasp adopts a multi-stage “perception-generation-optimization” framework that leverages reliable 3D cues from modern perception systems while introducing a generative motion prior to ensure temporal and global consistency.

EgoGrasp operates in three stages: (1). Preprocessing: We recover accurate camera trajectories and dense geometry from egocentric videos, establishing consistent world coordinates. Initial 3D hand poses and object 6DoFs are extracted and aligned, providing robust spatial grounding and temporal initialization. (2). Motion Diffusion: A twostage decoupled diffusion model that generates coherent hand-object motion. The first stage produces temporally stable hand trajectories guided by SMPL-X [20] wholebody poses, mitigating egocentric viewpoint shifts and selfocclusions. The second stage refines hand-object interactions without CAD models, capturing natural dynamics and reducing world drift. (3). Test-time Optimization: A differentiable refinement that optimizes SMPL-X parameters to improve spatial accuracy, temporal smoothness and footground contact consistency. The body is reconstructed only as a structural prior to ensure realistic hand-body coordination, yielding globally consistent trajectories.

We validate EgoGrasp on H2O and HOI4D datasets, achieving state-of-the-art results in world-space hand estimation and HOI reconstruction, with strong global trajectory consistency-demonstrating robustness to dynamic camera motion and in-the-wild conditions.

Our key contributions are summarized as follows: • Motivated by the requirements of embodied AI, we present a comprehensive analysis of the limitations inherent in current hand pose estimation, hand-object inter-action modeling, and object 6DoF tracking approaches.

Building upon these insights, we introduce the task of world-space hand-object interaction (W-HOI). • We further propose a novel framework for W-HOI reconstruction from egocentric monocular videos captured by dynamic cameras. Our approach produces consistent world-space HOI trajectories, while remaining templatefree and scalable to arbitrary numbers of objects. • Extensive experiments demonstrate that EgoGrasp substantially o

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut