A labeled dataset of simulated phlebotomy procedures for medical AI: polygon annotations for object detection and human-object interaction
This data article presents a dataset of 11,884 labeled images documenting a simulated blood extraction (phlebotomy) procedure performed on a training arm. Images were extracted from high-definition videos recorded under controlled conditions and curated to reduce redundancy using Structural Similarity Index Measure (SSIM) filtering. An automated face-anonymization step was applied to all videos prior to frame selection. Each image contains polygon annotations for five medically relevant classes: syringe, rubber band, disinfectant wipe, gloves, and training arm. The annotations were exported in a segmentation format compatible with modern object detection frameworks (e.g., YOLOv8), ensuring broad usability. This dataset is partitioned into training (70%), validation (15%), and test (15%) subsets and is designed to advance research in medical training automation and human-object interaction. It enables multiple applications, including phlebotomy tool detection, procedural step recognition, workflow analysis, conformance checking, and the development of educational systems that provide structured feedback to medical trainees. The data and accompanying label files are publicly available on Zenodo.
💡 Research Summary
The paper introduces a curated computer‑vision dataset designed to accelerate research on automated medical training, specifically for phlebotomy—a critical skill for safe blood collection. The authors recorded high‑definition (1920 × 1080 px, 30 fps) videos of a standardized simulated phlebotomy procedure performed on a training arm. Recordings were made with a static tripod camera under controlled lighting, with two slightly different viewpoints and varying ambient illumination to introduce modest visual variability.
From these videos, 11 884 frames were selected using a Structural Similarity Index Measure (SSIM) filter: each candidate frame was compared to the most recently kept frame and retained only if SSIM < 0.95, thereby eliminating near‑duplicate images while preserving diverse visual contexts. Prior to frame extraction, all videos underwent automatic face detection and Gaussian blurring via the Python face_recognition library and OpenCV, ensuring that any incidental human faces were anonymized in compliance with privacy regulations.
Each retained image is annotated with polygonal segmentation masks for five medically relevant object classes: (0) disinfectant wipe, (1) gloves, (2) rubber band, (3) syringe, and (4) training arm. Annotation was performed using Roboflow 3.0. A “golden set” of 8 743 images was manually labeled by an expert; this set was used to train an initial segmentation model, which then auto‑labeled the remaining images. All auto‑generated masks were manually verified and corrected, and a random subset was cross‑checked by a second annotator to ensure consistency. The final annotation format follows the YOLOv8 segmentation specification: each line contains a class ID followed by a sequence of normalized (x, y) polygon vertices.
The dataset is organized according to the conventions of the YOLOv8 framework. Images are resized so that the longer side equals 640 px while preserving aspect ratio (resulting in 640 × 480 or 640 × 360 resolutions). No square padding is applied, allowing users to add padding if required by their models. The data are split into training (70 %), validation (15 %), and test (15 %) subsets, each containing matching image and label directories. A data.yaml file lists class names and the relative paths to each split, enabling immediate ingestion by standard training scripts.
Quality assurance was multi‑layered. Visual QA removed blurry or heavily occluded frames and confirmed that each frame contained the expected objects. Label QA verified correct class assignment and tight polygon boundaries (snap‑to‑edge functionality in Roboflow). Additionally, the authors trained a YOLOv8 segmentation model for up to 150 epochs on Apple Silicon (M1 Pro) and monitored four loss components—box_loss, cls_loss, seg_loss, and dfl_loss—as well as performance metrics: precision, recall, mAP@0.5, and mAP@0.5:0.95. All curves stabilized at high values, indicating that the annotations are both accurate and sufficiently diverse for robust model training. A lightweight sanity‑check training on the golden set further confirmed class separability and the absence of systematic labeling errors.
Limitations are openly discussed. Because the data were captured in a simulated environment with a static arm and limited background variation, they may not fully represent the visual complexity of real clinical settings (e.g., diverse patient skin tones, varying bedside lighting, multiple camera angles, and unpredictable hand postures). Only five object classes are annotated; additional tools such as blood‑drawing machines, tourniquets, or waste containers are omitted. Temporal annotations indicating procedural steps are also absent, restricting the dataset to frame‑level analysis rather than full sequence modeling. The authors acknowledge a residual risk that face‑blurring could fail in edge cases and recommend re‑assessment if original videos are ever reused.
Future releases aim to address these gaps by incorporating multiple camera viewpoints, a broader demographic of trainees, additional object classes, and temporal step labels, thereby enhancing the dataset’s applicability to real‑world clinical AI systems.
Overall, the dataset provides high‑quality polygonal segmentation masks for a well‑defined set of phlebotomy‑related objects, packaged in a ready‑to‑use YOLOv8 format. Its public availability on Zenodo under a CC BY 4.0 license, together with comprehensive documentation (README, example config, and metadata files), lowers the entry barrier for researchers developing object detection, human‑object interaction, workflow analysis, and automated feedback systems in medical training. The work fills a notable gap in publicly accessible medical‑procedure vision data and offers a solid foundation for both benchmark studies and downstream applications such as AR/VR tutoring, conformance checking, and multimodal IoT‑augmented monitoring.
Comments & Academic Discussion
Loading comments...
Leave a Comment