JRDB-Pose3D: A Multi-person 3D Human Pose and Shape Estimation Dataset for Robotics
Real-world scenes are inherently crowded. Hence, estimating 3D poses of all nearby humans, tracking their movements over time, and understanding their activities within social and environmental contexts are essential for many applications, such as autonomous driving, robot perception, robot navigation, and human-robot interaction. However, most existing 3D human pose estimation datasets primarily focus on single-person scenes or are collected in controlled laboratory environments, which restricts their relevance to real-world applications. To bridge this gap, we introduce JRDB-Pose3D, which captures multi-human indoor and outdoor environments from a mobile robotic platform. JRDB-Pose3D provides rich 3D human pose annotations for such complex and dynamic scenes, including SMPL-based pose annotations with consistent body-shape parameters and track IDs for each individual over time. JRDB-Pose3D contains, on average, 5-10 human poses per frame, with some scenes featuring up to 35 individuals simultaneously. The proposed dataset presents unique challenges, including frequent occlusions, truncated bodies, and out-of-frame body parts, which closely reflect real-world environments. Moreover, JRDB-Pose3D inherits all available annotations from the JRDB dataset, such as 2D pose, information about social grouping, activities, and interactions, full-scene semantic masks with consistent human- and object-level tracking, and detailed annotations for each individual, such as age, gender, and race, making it a holistic dataset for a wide range of downstream perception and human-centric understanding tasks.
💡 Research Summary
JRDB‑Pose3D is a large‑scale, real‑world dataset designed for multi‑person 3D human pose and shape estimation from a robot‑mounted viewpoint. Existing 3D pose datasets largely focus on single subjects, controlled laboratory settings, or static cameras, which limits their relevance for autonomous robots operating in crowded, dynamic environments. To fill this gap, the authors built upon the JRDB dataset—originally collected by the JackRabbot mobile robot navigating a university campus—and added SMPL‑based 3D pose, consistent body‑shape parameters, and per‑person track IDs for every frame. The dataset comprises 54 sequences (2,88,435 frames) captured by five pairs of stereo RGB cameras (providing a 360° panoramic view) and a lidar sensor. On average each frame contains 5–10 people, with peaks of up to 35 individuals, yielding roughly 0.6 M annotated frames (≈40 K unique pose samples).
The annotation pipeline consists of five stages: (1) initialization of 3D SMPL poses using CameraHMR, a state‑of‑the‑art multi‑human pose estimator that also predicts camera field‑of‑view; (2) rigid alignment of the initial 3D joints to the JRDB camera coordinate system via a PnP solution that leverages the precise intrinsic parameters and the existing 2D keypoints; (3) selection of a representative shape (β) for each tracked individual, followed by manual refinement to ensure shape consistency across the entire trajectory; (4) refinement of local pose parameters by minimizing reprojection error to the 2D keypoints while adding a second‑order temporal smoothness term that encourages realistic motion dynamics; and (5) thorough manual inspection and correction, especially for heavily occluded or out‑of‑frame body parts. This multi‑stage approach yields high‑quality, temporally coherent annotations with per‑keypoint occlusion labels.
Beyond 3D pose and shape, JRDB‑Pose3D inherits the full suite of JRDB annotations: 2D skeletons, 3D bounding boxes, dense semantic segmentation masks for the whole scene (including objects and background), long‑term tracking of all entities, and detailed demographic metadata (age, gender, race). Interaction labels capture human‑human and human‑environment contacts, while group‑level activity tags describe collective behaviors such as conversation or walking together. Consequently, the dataset supports a wide range of downstream tasks, including multi‑person 3D pose estimation, pose‑aware activity recognition, interaction‑aware tracking, and physics‑based reasoning about human‑scene contact.
A comparative table positions JRDB‑Pose3D against 20+ prior datasets. It uniquely combines a moving robot viewpoint, indoor and outdoor scenes, a moving camera, a high number of simultaneous subjects (5–36), and comprehensive annotations (SMPL pose + shape, 2D pose, semantic masks, interaction, demographic data). In contrast, datasets like WorldPose provide large crowds but only from a top‑down static view, limiting applicability to ground‑level robot perception.
The authors evaluate several state‑of‑the‑art multi‑person pose estimation, tracking, and action‑aware prediction models on JRDB‑Pose3D. Results show a marked performance drop under heavy occlusion and when shape information is ignored, highlighting the importance of shape‑aware and interaction‑aware modeling for real‑world robotics.
Limitations include the manual verification frequency (every 15th frame) which may leave occasional annotation errors, and potential lidar noise in outdoor scenes affecting global pose alignment. Nevertheless, JRDB‑Pose3D offers the most realistic benchmark for robot perception of crowded human environments, paving the way for advances in multi‑modal learning, physical simulation, and socially aware robot navigation.
Comments & Academic Discussion
Loading comments...
Leave a Comment