FR-GESTURE: An RGBD Dataset For Gesture-based Human-Robot Interaction In First Responder Operations

FR-GESTURE: An RGBD Dataset For Gesture-based Human-Robot Interaction In First Responder Operations
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The ever increasing intensity and number of disasters make even more difficult the work of First Responders (FRs). Artificial intelligence and robotics solutions could facilitate their operations, compensating these difficulties. To this end, we propose a dataset for gesture-based UGV control by FRs, introducing a set of 12 commands, drawing inspiration from existing gestures used by FRs and tactical hand signals and refined after incorporating feedback from experienced FRs. Then we proceed with the data collection itself, resulting in 3312 RGBD pairs captured from 2 viewpoints and 7 distances. To the best of our knowledge, this is the first dataset especially intended for gesture-based UGV guidance by FRs. Finally we define evaluation protocols for our RGBD dataset, termed FR-GESTURE, and we perform baseline experiments, which are put forward for improvement. We have made data publicly available to promote future research on the domain: https://doi.org/10.5281/zenodo.18131333.


💡 Research Summary

The paper introduces FR‑GESTURE, a novel RGB‑D dataset specifically designed for gesture‑based control of unmanned ground vehicles (UGVs) in first‑responder (FR) operations. Recognizing the growing complexity of disaster scenarios and the need for intuitive human‑robot interaction, the authors first compiled a set of twelve hand‑signals that map directly to practical UGV commands such as “come to me,” “need help,” “stop,” “emergency,” “move away,” “ok to go,” “evacuate,” “lost connection,” “operation finished,” and three fetch commands (shovel, axe, gas mask). These gestures were derived from existing FR hand‑signals and tactical signals, then refined through iterative feedback from seven experienced first‑responders, ensuring operational relevance and ergonomic feasibility.

Data acquisition employed two Intel RealSense D415 cameras positioned at different heights and angles to capture diverse viewpoints. Each of the seven participants performed every gesture at six to seven distances ranging from approximately 1 m to 7 m, across three distinct scenes (one outdoor, two indoor). This systematic variation yields 3,312 RGB‑D image pairs (480 × 640 resolution, stored as PNG), with each gesture class equally represented to avoid class imbalance. The collection process also deliberately included partially occluded frames and motion‑blurred samples to emulate real‑world “in‑the‑wild” conditions. Metadata—including subject ID, distance index, scene identifier, and camera viewpoint—were logged in CSV files, and a supervisory review removed duplicates or erroneous captures.

Two evaluation protocols are defined: (1) a “Uniform” protocol that randomly shuffles all samples and applies 5‑fold cross‑validation, measuring overall performance; and (2) a “Subject‑Independent” protocol that holds out one participant for testing while training on the remaining six, thereby assessing generalization to unseen users. These protocols provide complementary insights into both overall accuracy and user‑independent robustness.

Baseline experiments employed three well‑known 2‑D convolutional networks: ResNet‑18, ResNet‑50, and ResNeXt‑50. The authors experimented with three input configurations: RGB only, depth only, and a simple concatenation of RGB and depth channels. Under the Uniform protocol, the best model (ResNeXt‑50 with RGB‑D input) achieved 92.3 % classification accuracy, while the Subject‑Independent protocol yielded a lower but still respectable 84.7 % accuracy, highlighting the challenge of user variability. Depth information consistently improved performance, confirming the value of multimodal sensing for gesture recognition in variable lighting and background conditions.

The paper situates FR‑GESTURE within the broader landscape of gesture‑based robot control datasets. Existing collections such as URGR, MD‑UHGRD, and LRHG either target UAV navigation, rely on a limited set of commands, or lack public availability. FR‑GESTURE distinguishes itself by (a) focusing on ground‑vehicle control for first‑responders, (b) offering a richer command set (12 classes), (c) providing synchronized RGB‑D data, and (d) releasing the full dataset with comprehensive metadata.

Limitations are acknowledged: the participant pool is modest (seven subjects), potentially restricting demographic diversity; only static gestures are captured, leaving continuous or dynamic gesture recognition for future work; and the cameras were fixed, whereas a real UGV would carry a moving camera, introducing additional viewpoint dynamics. The authors suggest expanding the participant base, incorporating dynamic gestures, and collecting data with on‑board robot cameras as next steps.

In summary, FR‑GESTURE constitutes the first publicly available, FR‑oriented RGB‑D gesture dataset with a well‑defined command mapping, multi‑distance and multi‑environment coverage, and clear evaluation protocols. It offers a solid benchmark for researchers developing lightweight, robust vision‑based HRI systems, multimodal deep learning models, and real‑time gesture recognition pipelines tailored to disaster‑response robotics.


Comments & Academic Discussion

Loading comments...

Leave a Comment