Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation

Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.


💡 Research Summary

The paper tackles the fundamental challenge of acquiring high‑quality visual attention data for robot‑assisted minimally invasive surgery (RMIS) training. Expert gaze is costly to collect, and it is unclear whether gaze recorded during active task execution (active gaze) or during passive video viewing (passive gaze) can serve as a viable supervisory signal for learning‑based perception models. To answer this, the authors built a paired active‑passive dataset on the da Vinci SimNow simulator across four representative drills (Sea Spikes, Ring Rollercoaster, Knot‑tying, and Big Dipper needle driving). Active gaze was captured with a Varjo Aero head‑mounted display (200 Hz eye‑tracking) while participants performed the tasks, and the same stereo video streams were later presented on a monitor while gaze was recorded with a Gazepoint GP3 HD tracker (150 Hz) for passive viewing.

Participants were divided into two skill levels: novices (12 first‑year medical students with <1 hour prior exposure) and intermediates (three research team members who met a performance threshold of ≥95 on each drill). The design yields four conditions: Intermediate‑Active (IA), Intermediate‑Passive (IP), Novice‑Active (NA), and Novice‑Passive (NP). Fixations were extracted using the I‑DT dispersion‑threshold algorithm, converted into Gaussian heatmaps, and down‑sampled to 160 × 128 for model input.

The authors evaluated gaze organization with a suite of metrics: fixation count and duration, fixation‑to‑non‑fixation ratio, scan‑path speed (center‑based), convex‑hull area, and two density‑map similarity measures (histogram intersection FDM‑SIM and Pearson correlation FDM‑CC). These metrics capture both temporal dynamics and spatial spread, allowing a nuanced comparison of skill‑ and modality‑dependent gaze patterns.

For modeling, two single‑frame saliency approaches were compared: MSI‑Net, a supervised CNN encoder‑decoder that predicts spatial attention maps, and SalGAN, a generative adversarial network that encourages realistic saliency map distributions. MSI‑Net proved more stable across all data splits, while SalGAN exhibited training instability and poorer alignment with human fixations, especially when trained on passive data.

Key experimental findings:

  1. Passive gaze can substitute a large portion of active gaze supervision. Models trained on IP data recovered ≈70 % of the IA performance, indicating that passive labels retain much of the expert’s visual strategy.

  2. Transfer is asymmetric. Predicting active fixations from passive‑trained models yields lower performance than predicting passive fixations from active‑trained models, reflecting the tighter coupling of active gaze to motor planning and error monitoring.

  3. Novice‑passive labels approximate intermediate‑passive labels. Despite being collected from less skilled observers, NP data still provide useful supervision for intermediate‑level models, enabling a scalable, crowd‑sourced pipeline with modest loss in quality.

  4. Model choice matters. MSI‑Net’s deterministic, encoder‑decoder architecture offers consistent, interpretable predictions suitable for real‑time surgical assistance, whereas SalGAN’s adversarial training is sensitive to data quantity and quality, leading to noisy outputs.

The authors also detail a careful data‑centric split strategy: each of the four conditions was partitioned into training, validation, and test sets per task, preserving diversity of demonstrations and error types. Demonstrations were ranked by a composite performance index (score, time, penalties) to ensure balanced sampling across skill levels.

Overall, the study demonstrates that high‑fidelity surgical gaze models do not strictly require costly active recordings from experts. Passive gaze, especially from intermediate‑skill observers, can serve as an effective supervisory signal, opening the door to large‑scale, crowd‑sourced gaze collection for surgical coaching, automated skill assessment, and perception‑driven robot control. The work also underscores the importance of data‑centric design—careful curation, balanced splits, and appropriate metric selection—in building robust, generalizable visual attention models for medical robotics.


Comments & Academic Discussion

Loading comments...

Leave a Comment