Automated Assessment of Kidney Ureteroscopy Exploration for Training

Automated Assessmen t of Kidney Ureteroscop y Exploration for T raining F ang jie Li 1* , Nic holas Kav oussi 2 , Charan Mohan 2 , Matthieu Chabanas 1 , Jie Ying W u 1 1* Departmen t of Computer Science, V anderbilt Universit y , 2301 V anderbilt Place, Nash ville, 37235, TN, USA. 2 Departmen t of Urology , V anderbilt Univ ersity Medical Cen ter, 1211 Medical Center Driv e, Nashville, 37232, TN, USA. *Corresp onding author(s). E-mail(s): fangjie.li@v anderbilt.edu ; Abstract Purp ose: Kidney ureteroscopic navigation is c hallenging with a steep learning curv e. Ho wev er, curren t clinical training has ma jor deﬁciencies, as it requires one- on-one feedbac k from exp erts and occurs in the op erating ro om (OR). Therefore, there is a need for a phantom training system with automated feedback to greatly expand training opp ortunities. Metho ds: W e prop ose a nov el, purely ureteroscop e video-based scop e lo cal- ization framew ork that automatically identiﬁes calyces missed b y the trainee in a phan tom kidney exploration. W e use a slow, thorough, prior exploration video of the kidney to generate a reference reconstruction. Then, this reference reconstruction can be used to lo calize any exploration video of the same phantom. Results: In 15 exploration videos, a total of 69 out of 74 calyces were correctly classiﬁed. W e achiev e < 4 mm camera p ose localization error. Giv en the ref- erence reconstruction, the system takes 10 minutes to generate the results for a t ypical exploration (1-2 minute long). Conclusion: W e demonstrate a nov el camera lo calization framework that can pro vide accurate and automatic feedbac k for kidney phantom explorations. W e sho w its ability as a v alid tool that enables out-of-OR training without requiring sup ervision from an exp ert. Keyw ords: Ureteroscopy , Endoscopy , Structure from Motion, Lo calization, Surgical T raining 1 1 In tro duction In ureteroscopic kidney stone remov al op erations, up to 20% of the patien ts require a second op eration due to missed stones [ 1 ]. This is partly due to the c hallenging nature of navigating through the kidney collecting framework [ 2 ], which requires pre- cise endoscopic manipulation and kno wledge of the kidney’s anatomy to ensure that ev ery kidney cavit y , called a calyx, is fully visited. Accurately na vigating the kidney has a steep learning curve. Unfortunately , training opportunities are limited because the current training paradigm relies on one- on-one, apprenticeship-st yle guidance during op erating r o om (OR) cases, which are sub ject to signiﬁcant time and safety constraints [ 3 ]. Additionally , trainees often only receiv e limited verbal feedback at the end of a case [ 4 ], which is based primarily on an exp ert’s sub jectiv e judgment. In this work, we aim to improv e ureteropscopy training b y in tro ducing an automated ob jective feedback mec hanism when exploring kidney phan toms. In prior work [ 5 ], we introduced anatomically accurate phantoms that can b e used for training outside of the OR. Ho wev er, training on these phantoms still requires one- on-one guidance from an exp ert. Although electromagnetic tracking used in [ 5 ] can pro vide automated feedback on exploration completeness, they also increase the hard- w are cost and complexity . Automatic assessment of exploration completeness without additional hardw are ma y facilitate the adoption of ureteroscopy training on phantoms. Computer vision-based metho ds suc h as Simultaneous Localization and Mapping (SLAM) and Structure from Motion (SfM) show great promise in reconstructing the anatomical scene and localizing scop e p oses through video input alone [ 6 , 7 ]. Ho wev er, their p erformance is susceptible to the qualit y of exploration videos [ 8 ]. This leads to frequen t failures when using the algorithms on p oor quality exploration videos with high motion blur, suc h as those generated by trainees learning to use the endoscope. In this pap er, we propose a nov el, purely ureteroscop e video-based framework that measures the trainee’s exploration co verage of a kidney , pro viding automatic feedbac k. It identiﬁes calyces missed by the trainee. The framework do es not require additional hardw are other than a consumer-grade computer. W e ov ercome the c hallenges with existing metho ds for ureteroscop e-based reconstruction by using a slow and thorough reference exploration of the kidney phantom to build its reference 3D reconstruction. This reconstruction acts as a reusable prior, simplifying the p ose lo calization problem for the challenging normal-sp eed query exploration videos by trainees. The reconstruc- tion can b e re-used for any exploration videos of the same phan tom. The code base will b e publicly a v ailable up on the acceptance of the paper. 2 Related W ork Sim ulation Mo dels for Uretroscopic T raining: T raditionally , surgical train- ing follow ed an appren ticeship model, but simulation-based training is gaining traction as they add training opp ortunities, provides ob jective p erformance metrics, and hav e demonstrated educational eﬀectiveness [ 9 ]. Physical bench-top mo dels exist for urol- ogy applications, such as the Uro-Scopic T rainer (Limbs & Things Ltd., Bristol, UK), Scop e T rainer (Mediskills Ltd., Edin burgh, UK), and the adult ureteroscopy trainer 2 (Ideal Anatomic Modelling, MI). They mimic the anatomical structure and texture with high ﬁdelity . Their realism are conﬁrmed through clinical user studies, and users’ score in the system correlated with the user’s exp erience [ 10 – 12 ]. Despite their realism, none of these ph ysical models ha ve automatic feedbac k on task p erformance. 3D Reconstruction in Surgical Applications: 3D reconstruction or camera p osition tracking in medical research ﬁeld is progressing quickly due to its abilit y to pro vide intraoprativ e information of the surgical scene to improv e guidance preci- sion [ 7 , 13 ]. In colonoscopy , such reconstruction algorithms hav e b een developed for ensuring full exploration co verage [ 13 ], similar to our goal in ureteroscopy . 3D recon- struction with ureteroscop y is relatively underexplored. Maza, et al. [ 6 ] adapted a SLAM algorithm for ureteroscopy applications, adding image prepro cessing and adapt- ing a new image features detection algorithm. Nonetheless, the system can still lose trac k under rapid motions. Acar et al. [ 14 ] performed SfM reconstruction of patient and the CT rendered ureteroscop y using diﬀeren t methods, of whic h hloc [ 15 ] pro- duced most robust results. Overall, endoscop e images with p o or image quality such as motion blur present signiﬁcant challenge for these system to function robustly [ 8 ]. 3 Metho dology 3.1 F ramework Description Our framework consist of tw o stages (Fig. 1 ) . In the ﬁrst stage, a reference 3D recon- struction of the phantom’s collecting framew ork is generated with an SfM algorithm and tw o slow and thorough reference exploration videos [ 14 , 16 ]. The reference recon- struction consists of a 3D p oin t cloud of the kidney collecting system, and endoscop e p oses of the slow reference exploration frames lo calized within the p oin t cloud. The reconstruction is then man ually registered to the computed tomograph y (CT) segmen- tation of the phan tom. In the second stage, the framework receiv es a normal-sp eed query video of a phantom exploration from each trainee. The framework lo calizes the query video frames based on the reference reconstruction from stage one. This allows the framew ork to classify kidney calyces as visited or missed. The framework then dis- pla ys lo calization information to the user via an annotated CT segmentation. F or any phan tom, the reference reconstruction (and hence, the slow exploration) from stage one can b e reused for lo calizing any num b er of exploration videos in stage tw o. Addi- tionally , this tw o-stage approach allows for the calyx-level lo calization of videos that cannot b e reconstructed on their own due to their limited qualit y. Stage 1: Reference Mo del Generation: F or each phantom, w e use tw o slow, thorough explorations as reference videos of the kidney phan tom’s collecting system. Because w e use an SfM pip eline inv ariant to image order, w e simply concatenated the frames from tw o videos. This results in a total of 1500–2300 frames p er phantom, from which we generate a reference reconstruction of the collecting system. T o ensure maxim um co verage and high image quality , we ask exp erts to p erform the explorations. W e employ an SfM pip eline based on the hlo c to olkit, b ecause in our prior endoscopic reconstruction work [ 14 , 16 ], hloc met our robustness needs and compared fav orably to alternative SfM pip elines on similar data. In principle, an y SfM/SLAM approach that is reliable for the slow videos could b e suﬃcient for our framework. The pip eline uses NetVLAD [ 17 ] for covisible image retriev al, ALIKED [ 18 ] for feature detection, 3 SfM C T Segme nta ti on & Annota ti on Stag e 1: Re f er enc e Model Gener a ti on Vis it ed / Mis se d Ca ly ces ICP - Regis t er ed SfM Reco ns tru ction & S egmenta tion ICP F r ame - by - F r ame L oc aliz a ti on Modul e Q ue r y V ide o Re f er enc e V ide o Stag e 2: Q ue r y L oc aliz a ti on S egmenta tion w/ Ca lyx Annota tions SfM Reco ns tru ction L oc a liz ed F r a me P oses a nd Vis ib le V ertices Fig. 1 The overall workﬂo w of the framew ork. In stage one, a reference reconstruction is generated and registered to the CT segmentation. This reconstruction is then reused for all query exploration videos, which are lo calized in stage two. F or each video, calyces are marked as visited/not visited. Ligh tGlue for feature matching, and COLMAP [ 19 ] for multi-view 3D reconstruction. The phan tom is scanned with computer tomograph y (CT) and manually segmented b y the authors. The reference reconstruction is registered to the CT segmentation using Iterative Closest Poin t (ICP) with manual initialization. The individual calyces in the CT segmen tation are also manually annotated, to enable p er-calyx visitation classiﬁcation. The renal pelvis and ureter are not annotated as exploration in those regions is trivial for kidney exploration. The slow videos are only required once at this stage for each phantom to generate a reference reconstruction, and all query videos of the same phantom can reuse the same reference reconstruction. Stage 2a: Query Video Lo calization Using the reference reconstruction and CT segmentation, we can lo calize c hallenging, normal-sp eed query exploration videos of the same phan tom. F or eac h query frame, a t wo-stage image retriev al process is p erformed. First, NetVLAD retrieves candidate covisible reference images. Second, to impro ve lo calization robustness, we use a lo cal-feature based matc h ﬁlter to remov e falsely matc hed reference images. F eature extraction and matching are performed with ALIKED and LightGlue. Outlier matches are ﬁltered through RANSAC-based essen- tial matrix estimation, and reference frames with inlier matches, inlier ratio below thresholds are rejected. The remaining high-conﬁdence reference frames matches are then used to determine the camera p ose asso ciated with the query frame. Afterwards, w e deplo y further spatio-temporal consistency ﬁltering to remo ve any incorrectly local- ized frames. First, all query frames that are localized outside of the segmen tation mesh are rejected. Second, query frames are ﬁltered based on its distance to the last local- ized frame. W e set a dynamic distance threshold based on the diﬀerence in time from the last localized frame (ex. frames further separated in time may b e further separated in distance). 4 Stage 2b: Calyx Visit Score Computation F or each lo calized query frame, w e obtain its 6 degrees-of-freedom (DoF) poses lo calized against the CT segmen tation. Com bined with the in trinsics parameters of the ureteroscop e, we render the view from the estimated camera p ose within the CT segmentation mesh (Fig. 3 ). W e then mark the v ertices o f the CT segmen tation visible from the lo calized p ose, through ra y casting. Given the en tire query video, we aggregate all the vertices that were viewed. W e then compute the visitation score for each calyx individually . F or each calyx, w e determine the total num b er of v ertices b elonging to that calyx and the subset of those vertices that w ere visited. The visitation score is then deﬁned as the ratio of view ed v ertices to the total num b er of vertices in the calyx. W e then set a threshold, V S thd , where calyces with a score higher than V S thd are considered thoroughly visited. Ov erall, this pro duces a binary classiﬁcation output for all segmen ted calyces in the phan tom, where each calyx is marked as visited or missed. The main renal p elvis and en trance are not marked, as they represent trivial exploration targets. P arameters and Filter Threshold Settings All parameters in the framew ork are set globally (i.e., not p er model or per video), based on heuristics dev elop ed in prior studies and cross-v alidation, and ﬁxed prior to ev aluation. In Stage 2a, RANSAC ﬁlter parameters for rejecting inv alid image and feature matc hes w ere set heuristically based on our prior exp erience with feature-based endo- scopic reconstruction [ 14 , 16 ] and a small subset of frames from three videos. Suc h RANSA C-based ﬁltering is standard practice in SfM pipelines. These parameters were then ﬁxed and applied uniformly to all other videos to a void ov erﬁtting. Also in Stage 2a, a distance-based ﬁlter is applied to ﬁlter out incorrectly lo calized p oses. W e ﬁrst selected one of the three videos mentioned ab o ve, and estimated the scop e velocity based on the reconstruction. W e then set the ﬁlter threshold heuristically and conserv atively at appro ximately 135 mm/s—well ab ov e normal exploration speeds for our phantoms, which are no longer than 150 mm in length. F or the visitation threshold V S thd in stage 2b, whic h con trols the ﬁnal output accuracy , w e p erformed 5-fold cross-v alidation on the 15 trainee videos, with each fold containing three randomly sampled videos. F or each fold f , visitation scores were computed for all calyces in the train-set videos. Based on manual exp ert annotations, calyces were divided in to visited and unvisited sets, yielding tw o score distributions: { S f v isited } and { S f non v isited } . T o b est separate the tw o classes, the visitation threshold for fold f is then deﬁned as the midp oin t betw een the mean scores of the tw o sets: V S f thd = mean  { S f v isited }  + mean  { S f non v isited }  2 (1) This c hoice provides a simple, unbiased decision b oundary b et ween the tw o score distributions. T o impro ve robustness, the 5-fold cross-v alidation w as rep eated ﬁv e times with diﬀerent random seeds. 3.2 Exp erimen ts Video Data Collection: W e used anatomically accurate silicone phan tom based on patien t CT of kidneys with normal anatomy . The phantoms CT are easy to segmen t 5 as they were homogeneous in material. These phantoms were fabricated following the metho d describ ed in [ 5 ], with BegoStone fake kidney stones (Bego, USA) inserted. In total, four phantoms were included and we collected data in t wo separate experimental sessions. In a ﬁrst exp erimen t, t wo expert surgeons p erformed, for eac h phantom, tw o thorough explorations at slow sp eed and one exploration at normal speed. The slow explorations pro vide high quality , comprehensiv e scop e of the kidney and are used for generating reference reconstructions (stage 1 of the metho d). The normal-speed explorations serv e as v alidation query videos to quantitativ ely ev aluate stage 2, i.e., that frames collected at normal sp eed can b e localized with resp ect to the reference reconstruction. F or each phan tom, one slo w and one normal-sp eed exploration w ere electromagnetically (EM) track ed to pro vide ground truth camera poses for ev alua- tion. An EM sensor w as ﬁxed at the tip of the ureteroscop e and trac ked using the Aurora EM tracking system (North Digital Inc, Canada). In a second experiment, we collected exploration videos conducted at normal-speed b y four surgical trainees, o ver a m ulti-month p erio d. Eac h trainee explored the four phan toms, under the guidance of an exp ert surgeon. F or one trainee, we failed to record one exploration, leaving us with 15 trainee query videos. The videos are used as query and lo calized in stage 2 of the metho d. EM trac king w as not emplo yed, b ecause attac h- ing the EM sensor noticeably increases the ureteroscop e diameter (scop e diameter: 2 mm, sensor diameter: 1 mm), and in terferes with normal scop e handling, reducing the ﬁdelity of the simulated ureteroscop e experience for trainees. The trainees, with t wo to four y ears of postgraduate experience, w ere guided by an exp ert either ver- bally (standard clinical practice) or using a previously dev elop ed AR-based eye-gaze guidance system [ 20 ]. All trainees are residents from the urology department of the V anderbilt Univ ersity Medical Cen ter, USA. Videos were recorded at 30 FPS and lasted b et ween 1 to 2 min utes. A stride of 2 w as applied. Scop e camera intrinsics were calibrated using a ChArUco board. Data was recorded with approv al from the hospital’s Institutional Review Board (IRB 231997). System Setup W e used a PC with an indep endent Graphics Card (R TX 4090) for all the computations. Reference Reconstruction Accuracy: W e measure if the SfM reconstruction p oin t cloud is accurate relativ e to the CT segmentation, so that camera p oses localized against this reconstruction are accurate against the real anatomy shap e. After register- ing the reference point cloud to the CT segmen tation, w e compute the mean Euclidean distance from the p oint cloud to the segmentation mesh (single-sided c hamfer dis- tance). W e also compute 99 p ercentile Hausdorﬀ distance, which highlights outliers in the reconstruction. Additionally , we compute the reconstruction co verage and the repro jection error. The reconstruction cov erage is deﬁned as the percentage of CT p oin t cloud points that ha ve a corresp onding p oint in the reference reconstruction within 1 mm distance. The repro jection error is the pixel-space distance b etw een an observ ed 2D feature p oin t and the pro jected lo cation of its corresponding 3D point under the estimated camera p ose and intrinsics. 6 T able 1 Reference Model Poin t Cloud CT Registration Error Phantom ID Mean Euclidean Distance (mm) 99% Hausdorﬀ Distance (mm) Cov erage (%) Pro jection Error (px) 1 1.0 ± 1.8 6.1 58.19 1.19 2 1.7 ± 1.8 4.8 43.17 1.14 3 1.1 ± 1.9 6.2 63.76 1.32 4 1.4 ± 1.6 5.3 66.30 1.26 Lastly , we ev aluate the accuracy of SfM p ose lo calizations of the EM trac ked ref- erence video frames. W e use 10% of the lo calized reference frame p oses as ﬁducials to estimate the transformation betw een the reconstruction and EM trac ker coordinate frame origins robustly. With this transform, w e then register the remaining refer- ence frame p oses as targets. W e compute the mean Euclidean distance b etw een the transformed SfM lo calizations and EM track ed ground truth. Camera P ose Lo calization Accuracy: F or EM-track ed normal-sp eed explo- ration videos from the ﬁrst experiment, localized using stage 2 of the framework, we quan titatively ev aluate p ose lo calization accuracy . The same EM-to-reconstruction transformation describ ed abov e is applied, and the mean Euclidean distance to the EM-trac ked ground-truth p oses is computed. EM tracking is not av ailable for trainee explorations. W e then p erform qualitative, man ual assessmen t of camera localization accuracy . F or each frame, w e generate a rendered view of the CT segme n tation using the camera intrinsics and estimated p ose (Fig. 3 ). Then, we manually review the CT renders side-by-side with the actual query frames, to conﬁrm that that they represen ted the same anatomical p ositions. Visitation Classiﬁcation Accuracy: A c hallenge with SfM-based localization is that it can fail to lo calize frames, due to the lack of salient visual features. Therefore, w e assess if the framework can capture the ov erall exploration path, even with missed frames. Two independent reviewers annotated explored and missed calyces for eac h query video. Where annotations diﬀered, they discussed and came to agreemen t. These binary annotations are compared against the framew ork’s output. 4 Results Reference Reconstruction Accuracy: As shown in T able 1 and Fig. 2 , all phan- tom reconstructions had mean Euclidean p oin t distances of < 2 mm , with a standard deviation of also under 2 mm , suggesting go o d reconstruction accuracy . All recon- struction had 99% p ercentile Hausdorﬀ distance b elow 6 . 5 mm , suggesting small outliers. Notably, the reconstruction do es not cov er the entiret y of the CT segmenta- tion. Nonetheless, for phan tom 1-3, all calyces had a signiﬁcant p ortion of its tubular structure reconstructed (Fig. 2 ). F or phantom 4, the lo wer most calyces are partially reconstructed, due to their hard-to-reach geometry . Quan titative Reconstruction and Lo calization Camera Pose Accuracy: In T able 2 , we displa y the camera p osition error of EM trac ked reference and query videos. All errors are b elow 4 mm . 7 Fig. 2 The accuracy of the SfM reconstruction p oint cloud compared to the CT v olume. The violin plot is plotted with the 99 p ercentile data, as the outliers, though numerically extreme, has almost no impact on the CT registration and the subsequen t p ose lo calization. T able 2 The Camera Pose Lo calization Accuracy of Exp ert Recordings Phantom ID V ideo ref em Position Error (mm) V ideo query em Position Error (mm) 1 2.7 ± 2.0 2.6 ± 1.6 2 2.9 ± 1.7 3.5 ± 1.9 3 3.3 ± 1.7 3.8 ± 1.7 4 3.0 ± 1.7 3.2 ± 2.00 Qualitativ e Lo calization Accuracy: In Fig. 3 , we displa y a random selection of the rendered vs real ureteroscop e image pairs, from query trainee videos and their estimated lo calization. In all frames, the corresponding anatomical landmarks are clearly identiﬁable. Visitation Classiﬁcation Accuracy: In the cross-v alidation study , an av erage of 69 out of 74 calyces were correctly classiﬁed. The corresp onding classiﬁcation accuracy is 92 . 8% (CI: 91 . 6% − 94 . 0%). The mean visitation thresholds are V S f thd = 0 . 45 ± 0 . 06. In Fig. 4 , w e display examples of the correct visitation outputs. In Fig. 5 , w e displa y three failed cases, where the framework misclassiﬁed one calyx in each. System Runtime: Generating the reference mo del to ok around 40 minutes for eac h phan tom, giv en 1500-2000 reference frames (after striding). At stage 2, A query video with 1000 frames (after striding) to ok around 10 min utes. This promises a semi-real-time ev aluation to ol for phan tom training. 8 Fig. 3 A random selection of rendered vs real ureteroscop e image Pairs. (Note: the kidney stones are not in the CT view) Fig. 4 Fiv e example cases where the framework accurately iden tiﬁes visited/missed calyces 5 Discussion 5.1 Stage One - R eference Reconstruction Qualit y The reference models are geometrically accurate with small outliers, giv en the small Euclidean distance and Hausdorﬀ distance. The ma jor anatomical structures, especially the calyces, corresp ond well to the CT segmen tation. The clinically relev ant regions, particularly the calyces, are well reconstructed. The relativ ely lo w numerical reconstruction co verage is mainly due to missing entry-point and p osterior renal p elvis regions, whic h are generally not clinically relev an t. Some calyces lac k deep/distal p ortions due to limited parallax from predominantly forward motion, but most of eac h calyx is reconstructed, supporting query pose lo calization. In 9 Fig. 5 Example cases where the framework misidentiﬁed one calyx in eac h case. A: Phan tom 2. B: Phantom 3. C: Phan tom 4. phan tom 4, tw o lo wer calyces are incompletely reconstructed b ecause their geometry is hard to reac h, requiring near 180 ◦ ureteroscop e bending. This has tw o implications for lo calization: ﬁrst, such regions are often inaccessible to trainees as well, limiting their practical impact on visitation classiﬁcation; second, they represen t edge cases where framew ork-based classiﬁcation (and, to a lesser exten t, binary human annotations) b ecome ambiguous. Reconstruction Camera P ose Accuracy: The mean reference p ose errors are under 4 mm (T able 1 ) for all phantoms. The error is adequately low for the geometry, as the calyces ha ve a diameter of around 10 mm , and depths of ov er 20 mm . 5.2 Stage Tw o - P ose Lo calization Accuracy Quan titatively, the mean query p ose errors are all under 4 mm (T able 1 ), whic h is again adequately low for visitation classiﬁcation, given the anatom y dimensions men tioned abov e. Qualitativ ely , the rendered and real image pairs consisten tly depict the same anatomical structures in correct relativ e p ositions, demonstrating the framew ork’s pre- cision. There are only minor p osition or orientation inaccuracies in few cases. This is acceptable for categorical classiﬁcation of calyx visitation, which do not require highly accurate geometric lo calization. 5.3 Visitation Classiﬁcation Accuracy: Across cross-v alidation trials, the framework’s visitation classiﬁcations closely matc h exp ert annotations. In 15 trainee videos, the a verage accuracy is 92.8%, with 69 of 74 calyces correctly classiﬁed, indicating the framework provides robust automated training feedbac k. The visitation thresholds are relativ ely stable ( V S f thd = 0 . 45 ± 0 . 06) across folds, suggesting robustness. Ho wev er, the failure cases listed in ﬁgure 5 highlight some limitations of the curren t systems. In Fig. 5 -A, wrong classiﬁcation happens because the user only brieﬂy glances at a calyx, whic h the human annotator marks as missed. As the current framework analyzes each frame indep endently without considering view duration, this leads to a ”visited” classiﬁcation. An estimation of view duration and scop e tra jectory could resolv e this limitation. 10 In Fig. 5 -B, a visited calyx is mark ed as missed, due to the highly erratic motion the user used to reac h the calyx, w chic h renders almost all frames there blurred. The curren t framew ork is generally robust for normal-sp eed videos, b ecause it c an lo calize individual high-qualit y frames that cannot b e reconstructed on their o wn, even if most frames are blurred and of low quality . How ever, if the en tire video sequence explor- ing a calyx is of low qualit y , this remains a c hallenge. A retriev al metho d based on higher-lev el geometric or lumen shape features [ 21 ] may b e explored for their impro ved robustness ov er feature-based metho ds. Lastly , some cases in phantom 4 (Fig. 5 -C) had visited calyces marked as missed. As discussed, these are hard-to-reach edge cases with incomplete reconstructions, reﬂected b y the fact that no trainee fully explored the low er calyces. 5.4 Use Cases of Visitation Classiﬁcation Results The primary use case of the prop osed framework is to provide automatic, unsup er- vised feedback to trainees, for example through visual summaries of calyces visitation (Fig. 4 ). F uture human-computer-in terface fo cused studies are needed to ev aluate the eﬀect of alternativ e feedbac k mo dalities. Bey ond direct feedbac k, visitation clas- siﬁcation could serv e as a quan titative metric for skill assessmen t, provided that a meaningful correlation with trainee skill lev el or guidance type can b e established. The fo cus of this work w as to show that the prop osed framework can lo calize chal- lenging query frames robustly classify calyx visitation. With that in mind, and giv en the limited n umber of trainees in the dataset, no correlation w as observ ed b etw een visitation classiﬁcation outcomes and trainee skill lev el, or the guidance type. F uture studies with a larger cohort are needed to ev aluate these p oten tial relationships more rigorously . The system also can reveal cohort-lev el trends b y identifying anatomies missed across users. F or example, all user missed a small calyx in phantom three, and tw o out of four users missed the low er-most calyx in phan tom four. This can highligh t diﬃcult-to-access anatomies, informing trainee guidance. Finally , the framew ork can hav e p otential utility as a preop erativ e planning to ol. As the phantom can b e made from patient CTs, surgeons may explore the phantom prior to a surgery . The visitation analysis can then help identify calyces that are easy to miss. This may allow surgeons to b etter plan for c hallenging anatomies and p oten tially improv e surgical outcomes. 6 Conclusion In conclusion, we introduce a nov el approach for identifying missed calyces with high accuracy from exploration videos of kidney phantoms. The approac h allo ws for auto- mated training feedback on kidney exploration. This can reduce the amount of manual sup ervision in surgical training, and opens up a viable ven ue for out-of-OR training. Additionally , it can also b e a useful to ol for surgical planning, providing surgeons with b etter understanding of easy-to-miss calyces prior to the real surgery , p otentially leading to b etter surgical outcomes. 11 Ac kno wledgmen ts This study was partially supp orted by the NIBIB of the NIH Grant 1R21EB035783. References [1] Brain, E., Geragh ty , R.M., Lov egrov e, C.E., Y ang, B., Somani, B.K.: Natural history of p ost-treatmen t kidney stone fragments: A systematic review and meta- analysis. The Journal of Urology 206 (3), 526–538 (2021) [2] Y amany , T., Batavia, J., Ahn, J., Shapiro, E., Gupta, M.: Ureterorenoscop y for upp er tract urothelial carcinoma: how often are we missing lesions? Urology 85 (2), 311–315 (2015) [3] Arora, S., Sevdalis, N., Nestel, D., W oloshyno wych, M., Darzi, A., Kneeb one, R.: The impact of stress on surgical p erformance: a systematic review of the literature. Surgery 147 (3), 318–330 (2010) [4] Bai, H., Sasikumar, P ., Y ang, J., Billinghurst, M.: A user study on mixed realit y remote collab oration with eye gaze and hand gesture sharing. In: Proceedings of the 2020 CHI Conference on Human F actors in Computing Systems, pp. 1–13 (2020) [5] Acar, A., Atoum, J., Connor, P .S., Pierre, C., Lynch, C.N., Ka voussi, N.L., W u, J.Y.: Na vius: Na vigated augmen ted realit y visualization for ureteroscopic surgery . In: Medical Image Computing and Computer Assisted Interv ention – MICCAI 2025, pp. 433–443 (2025) [6] Oliv a Maza, L., Steidle, F., Klo dmann, J., Strobl, K., T rieb el, R.: An orb-slam3- based approach for surgical navigation in ureteroscop y . Computer Methods in Biomec hanics and Biomedical Engineering: Imaging & Visualization 11 (4), 1005– 1011 (2023) [7] Schmidt, A., Mohareri, O., DiMaio, S., Yip, M.C., Salcudean, S.E.: T racking and mapping in medical computer vision: A review. Medical Image Analysis 94 , 103131 (2024) [8] Widya, A.R., Monno, Y., Imahori, K., Okutomi, M., Suzuki, S., Goto da, T., Miki, K.: 3d reconstruction of whole stomac h from endoscop e video using structure-from-motion. In: 2019 41st Ann ual International Conference of the IEEE Engineering in Medicine and Biology Society , pp. 3900–3904 (2019) [9] Brunckhorst, O., Aydin, A., Abb oudi, H., Sahai, A., Khan, M.S., Dasgupta, P ., Ahmed, K.: Sim ulation-based ureteroscopy training: a systematic review. Journal of surgical education 72 (1), 135–143 (2015) [10] Matsumoto, E.D., Hamstra, S.J., Radomski, S.B., Cusimano, M.D.: A nov el 12 approac h to endourological training: training at the surgical skills cen ter. The Journal of urology 166 (4), 1261–1266 (2001) [11] Brehmer, M., T olley , D.A.: V alidation of a b ench model for endoscopic surgery in the upp er urinary tract. Europ ean urology 42 (2), 175–180 (2002) [12] White, M.A., DeHaan, A.P ., Stephens, D.D., Maes, A.A., Maatman, T.J.: V ali- dation of a high ﬁdelity adult ureteroscop y and renoscopy sim ulator. The Journal of urology 183 (2), 673–677 (2010) [13] Zhang, S., Zhao, L., Huang, S., Ma, R., Hu, B., Hao, Q.: 3d reconstruction of deformable colon structures based on preop erative mo del and deep neural net work. In: 2021 IEEE International Conference on Rob otics and Automation (ICRA), pp. 1875–1881 (2021) [14] Acar, A., Lu, D., W u, Y., Oguz, I., Kav oussi, N., W u, J.Y.: T ow ards navigation in endoscopic kidney surgery based on preoperative imaging. Healthcare T ec hnology Letters 11 (2-3), 67–75 (2024) [15] Sarlin, P .-E., Cadena, C., Siegw art, R., Dymczyk, M.: F rom coarse to ﬁne: Robust hierarc hical localization at large scale. In: CVPR (2019) [16] Acar, A., Smith, M., Al-Zogbi, L., W atts, T., Li, F., Li, H., Yilmaz, N., Scheikl, P .M., d’Almeida, J.F., Sharma, S., et al. : F rom mono cular vision to autonomous action: Guiding tumor resection via 3d reconstruction. In: 2025 IEEE/RSJ Inter- national Conference on Intelligen t Rob ots and Systems (IROS), pp. 21714–21720 (2025). IEEE [17] Arandjelovic, R., Gronat, P ., T orii, A., P a jdla, T., Sivic, J.: Netvlad: Cnn archi- tecture for weakly sup ervised place recognition. In: Pro ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307 (2016) [18] Zhao, X., W u, X., Chen, W., Chen, P .C., Xu, Q., Li, Z.: Alik ed: A ligh ter key- p oin t and descriptor extraction net work via deformable transformation. IEEE T ransactions on Instrumen tation and Measurement 72 , 1–16 (2023) [19] Sch¨ on b erger, J.L., F rahm, J.-M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016) [20] Atoum, J., Li, F., Acar, A., Ka voussi, N.L., W u, J.Y.: F rom sigh t to skill: A surgeon-cen tered augmented realit y system for ureteroscopy training. In: Medical Image Computing and Computer Assisted Interv en tion – MICCAI 2025, pp. 209– 218 (2025). Springer [21] Tian, Q., Liao, H., Huang, X., Y ang, B., W u, J., Chen, J., Li, L., Liu, H.: Bronc hotrack: Airw a y lumen trac king for branc h-level bronc hoscopic localization. IEEE T ransactions on Medical Imaging 44 (3), 1321–1333 (2025) 13

Automated Assessment of Kidney Ureteroscopy Exploration for Training

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment