Towards Controllable Video Synthesis of Routine and Rare OR Events

T o w ards Con trollable Video Syn thesis of Routine and Rare OR Ev en ts Dominik Sc hneider 1,2 † , Lalithkumar Seeniv asan 1 † , Sampath Rapuri 1 , Vishalroshan Anil 1 , Aiza Maksuto v a 1 , Yiqing Shen 1 , Jan Emily Mangulabnan 1 , Hao Ding 1 , Jose L. P orras 1,3 , Masaru Ishii 3 , Mathias Un b erath 1* 1* Johns Hopkins Univ ersity , Baltimore, 21211, MD, USA. 2* T echnical Univ ersit y Munich, Munic h, 80333, BY, German y . 3 Johns Hopkins Medical Institutions, Baltimore, 21287, MD, USA. *Corresp onding author(s). E-mail(s): un b erath@jh u.edu ; † Co-ﬁrst author. Abstract Purp ose: Curating large-scale datasets of op erating room (OR) workﬂo w, encompassing rare, safet y-critical, or atypical even ts, remains op erationally and ethically c hallenging. This data b ottlenec k complicates the developmen t of am bient in telligence for detecting, understanding, and mitigating rare or safety- critical even ts in the OR. Metho ds: This w ork presents an OR video diﬀusion framework that enables con trolled synthesis of rare and safet y-critical ev ents. The framew ork integrates a geometric abstraction mo dule, a conditioning mo dule, and a ﬁne-tuned diﬀusion mo del to ﬁrst transform OR scenes in to abstract geometric representations, then condition the synthesis pro cess, and ﬁnally generate realistic OR even t videos. Using this framework, we also curate a synthetic dataset to train and v alidate AI mo dels for detecting near-misses of sterile-ﬁeld violations. Results: In syn thesizing routine OR even ts, our metho d outperforms oﬀ-the-shelf video diﬀusion baselines, achieving low er FVD/LPIPS and higher SSIM/PSNR in b oth in- and out-of-domain datasets. Through qualitative results, w e illustrate its ability for controlled video synthesis of counterfactual even ts. An AI mo del trained and v alidated on the generated synthetic data achiev ed a RECALL of 70 . 13 % in detecting near safety-critical even ts. Finally , w e conduct an ablation study to quan tify p erformance gains from key design choices. 1 Conclusion: Our solution enables controlled syn thesis of routine and rare OR ev ents from abstract geometric representations. Beyond demonstrating its capa- bilit y to generate rare and safety-critical scenarios, we show its potential to supp ort the developmen t of am bient in telligence models. Keyw ords: OR Ev ent Generation, Am bien t Intelligence, OR Video Generation, Conditional Video Diﬀusion, Diﬀusion Mo dels 1 In tro duction With the op erating ro om (OR) b eing central to patient care [ 1 ] and hospital eco- nomics [ 2 ], driving eﬀorts to ward OR ambien t in telligence can enhance hospital p erformance b oth clinically and ﬁnancially . Clinically , enabling automated OR work- ﬂo w analysis and optimization reduces intraoperative risk and impro ves patien t outcome: automated detection of safety-critical even ts (e.g., sterile-ﬁeld breaches); shorter anesthesia exposure beneﬁts patients [ 3 ]; eac h additional 60 min utes of surgery increases the o dds of surgical-site infection b y 37% [ 4 ]; impro v ed coordination is associ- ated with few er complications [ 5 ]; and shorter case duration impro v es OR throughput, increasing access to more patients in need of care. Financially , global surgical demand exceeds existing hospital capacity [ 6 , 7 ]. With constant demand for surgery and ORs generating roughly 60-70% of hospital reven ue [ 2 ] and accounting for 35-40% of hospi- tal expenses [ 2 ], enabling automated OR w orkﬂow optimization could increase surgical throughput, increasing hospital rev en ue. F urthermore, optimizing OR workﬂo w could p oten tially allo w for b etter resource utilization and op erational eﬃciency [ 8 ], reducing hospital costs. Although emerging AI solutions bring OR ambien t in telligence within reac h, progress is constrained b y the lack of comprehensive datasets that capture the full sp ectrum of OR even ts necessary for model developmen t. Curating rare, safety-critical, or atypical OR even ts at scale is op erationally and ethically challenging. In clinical practice, generating that data is practically diﬃcult due to priv acy and access constraints, site and pro cedure v ariability , and the inherent rarit y of critical ev ents. Delib erately eliciting safety-critical rare ev ents – sterile ﬁeld violations, equipment handoﬀs, or deviations from standard proto cols – for enriching the dataset is ethically imp ermissible and risks patien t harm. Manual curation and staged reenactments are not scalable, given the breadth of clinical v ariability , staﬃ ng limitations, and op erational disruption. There is a clear need for scalable metho ds to generate OR even ts on demand – with rich pro cedural v ariation and rare, safety-critical scenarios – to support the developmen t of OR am bient in telligence. Adv ancing controllable and scalable data curation metho ds, w e introduce an OR video-diﬀusion framework conditioned on an abstract geometric scene represen tation to curate synthetic videos of routine and rare OR even ts. The framework represen ts the OR workspace and even ts using simple geometric primitives: ellipsoids for p erson- nel, the patien t, and equipment. Given an initial OR scene and abstract geometric represen tations of the in tended OR even t as conditioning input, it generates synthetic 2 videos of the sp eciﬁed ev en ts. The ev ent conditioning is modeled either from prior rou- tines derived from kno wn OR even ts or from user-deﬁned tra jectories on the abstract geometric represen tations. Our key contributions are: • W e introduce an abstract, geometry conditioned OR diﬀusion framework with a no vel geometric abstraction and conditioning mo dule, enabling con trolled, scalable syn thesis of OR-even t videos via ellipsoid-based en tity representation and tra jectory sk etches. • W e demonstrate the video syn thesis of routine, rare, atypical, and safety-critical OR even ts (sterile-ﬁeld violations) that would otherwise b e diﬃcult to collect due to practical/ethical reasons. • W e curate a synthetic dataset and train AI mo dels to detect sterile-ﬁeld violations, ac hieving 70% recall (sensitivity) and demonstrating the framework’s p oten tial to enable scalable data curation for am bient intelligence dev elopment. • Additionally , we augment the baseline ﬁne-tuning with a Patc hGAN loss to improv e the lo cal realism and ﬁdelity of the synthesized videos. 2 Related W ork Generativ e mo dels for general-purpose video syn thesis ha ve b een widely explored from early GAN-based [ 9 ] approaches to more mo dern diﬀusion mo del frameworks [ 10 ]. Among the general-purp ose video diﬀusion mo dels, Stable Video Diﬀusion (SVD) [ 11 ] is a p opular image-to-video diﬀusion mo del, incorp orating an image frame and an optional text prompt as conditioning inputs. In contrast, the W AN family of mo dels [ 12 ] represents a series of text-conditioned diﬀusion approac hes. While these mo d- els achiev e strong performance, they utilize natural-language prompts and/or single k eyframes as conditioning inputs, which lac k ﬁne-grained control o ver ob ject p osition- ing, orien tation, and in teractions. In the surgical domain, generativ e mo dels ha ve been adopted for simulation [ 13 ], using con trollable conditioning inputs to guide the video generation process. Typically , these conditioning signals are class labels, text prompts, reference images or videos, or tra jectory information, essential for accurately mo del- ing complex and high-risk surgical workﬂo ws. T o our knowledge, no prior work has ac hieved controllable generation of the ambien t op erating ro om en vironment, including staﬀ mo vemen t, equipment rep ositioning, and safet y-critical even ts. 3 Metho d Our proposed abstract, geometry conditioned OR diﬀusion framew ork reform ulates a video-to-video diﬀusion task into a controlled OR even t video generation task condi- tioned via abstract geometric scene represen tations. The framew ork incorporates three main mo dules: (i) geometric abstraction mo dule, (ii) conditioning mo dule, and (iii) diﬀusion module (Fig. 1 ). Giv en an initial OR scene, the geometric abstraction mo dule ﬁrst transforms it into an abstract geometric scene representation. The conditioning mo dule generates a temp oral series of abstract geometric scene representations, either based on prior routines from known OR even ts or based on user-generated tra jecto- ries on the geometric represen tation of the initial scene. The diﬀusion module – built 3 D i f f u s i o n M o d u l e A c c e p t s e i t h e r R o u t in e O R E v e n t s o r U s e r - d e fi n e d C o u n t e r fa c t u a l E v e n t s G e o m e t r i c A b s t r a c t i o n M o d u l e R o u t i n e O R E v e n t C o u n t e r f a c t u a l O R E v e n t U s e r - d e f i n e d T r a j e c t o r i e s F o r C o u n t e r f a c t u a l E v e n t s I n i t i a l S c e n e R o u t i n e O R E v e n t R o u t i n e O R E v e n t C o n d i t i o n i n g C o u n t e r f a c t u a l O R E v e n t C o n d i t i o n i n g I n i t i a l O R S c e n e C o n d i t i o n i n g O R E v en t C o n d i t i o n i n g : C o u n t e r f ac t u al / R o u t i n e C o n d i t i o n i n g M o d u l e Fig. 1 abstract, geometry conditioned OR diﬀusion framework consists of three main mo dules: (i) Geometric Abstraction Mo dule conv erts the initial OR scene into an abstract geometric scene representation using ellipsoids. (ii) Conditioning Mo dule generates temp oral sequences of abstract geometric scenes through tw o pathw ays: from routine OR even ts (blue dash path), or from incorp o- rating user-deﬁned tra jectories (dashdotted green path). (iii) Diﬀusion Mo dule synthesizes videos of OR even ts conditioned on the initial scene and the geometric sequences. on a video-to-video diﬀusion backbone – then uses the initial scene and the series of abstract geometric scene representations (video conditioning) as input to diﬀuse an OR ev ent. Abstract Geometric Scene Representation : The abstract geometric scene rep- resen tation can b e formulated as an implicit scene graph G = ( V , E ), where V = { v 1 , v 2 , . . . , v k } represen ts a set of k nodes corresp onding to OR en tities (OR person- nel, patient, and equipmen t), and edges ( E ) represent the implicit spatial and temp oral relationships betw een the no des. Eac h no de v j consists of geometric attributes g j ∈ R 6 and class information c j ∈ R 2 . The geometric attributes enco de (a) a 2-dimensional cen troid p osition, (b) a 3-dimensional ellipsoid representation capturing spatial spread (heigh t, width, and rotation angle), and (c) a 1-dimensional normalized relative depth v alue. The 2D class v ector c j ∈ R 2 represen ts the (R, G) color channel v alues for eac h en tity class, adopting the 36 semantic classes deﬁned in the MMOR dataset [ 16 ]. Since the blue channel is reserv ed for depth enco ding, the red and green channels enco de S A M 2 V i d e o D e p t h A n yt h i n g A b s t r a c t G e o m e t r i c S c e n e R e p r e s e n t a t i o n (R , G ) C l a ss I n f o rm a t i o n (B ) D e p t h I n f o rma t i o n M a n u a l S e g m e n t a t i o n P r o m p t + Fig. 2 Geometric abstraction mo dule pip eline: Giv en an initial scene and segmentation point prompts, SAM2 [ 14 ] propagates instance segmen tation masks across video. Depth information is estimated using Video Depth Anything [ 15 ]. Each segmen ted instance is then approximated b y an ellipsoid parameterized by its centroid p osition and spatial spread (heigh t, width, rotation angle). The resulting Abstract Geometric Scene Representation encodes class information in the red and green channels, combined with normalized relative depth in the blue channel intensit y . 4 seman tic class information, providing visually distinct represen tations for each en tity t yp e. With implicit edges, spatial relationships such as proximit y can b e derived from pairwise distances b et ween 2D centroids in normalized image space. Spatial la yering (‘in-fron t-of ’, ‘o ccluded-b y’) is captured through relativ e depth v alues. T emp orally , no des representing the same ob ject across frames form implicit corresp ondences. Rendering abstract scene represen tation: Each scene represen tation ( G ) is rendered as a 2D image at 1024 × 768 resolution (Fig. 2 ), with no des depicted as ellipses on a blac k can v as. Eac h ellipse is p ositioned at its cen troid and scaled and rotated according to its spatial spread parameters (height, width, and rotation angle). Class information is enco ded in the red and green color channels, pro viding unique colors for each ob ject class, while normalized depth is encoded in the blue channel intensit y . (i) Geometric abstraction mo dule: Giv en an OR scene, the abstract geometric scene representation is created using a semi-automated pip eline that employs out-of- the-b o x SAM2 [ 14 ] and Video Depth An ything [ 15 ] mo dels. Firstly , entities in the scene and their geometric parameters (ellipsoids’ centroid p osition, heigh t, width, and rotation angle) are extracted using segmentation masks generated through SAM2 using man ual (inference)/groundtruth (training) prompts. The 1D normalized relative depth v alues are extracted using depth maps and are av eraged ov er an instance mask. The extracted features are then used to render the abstract scene representation. (ii) Conditioning mo dule: With the video generation formulated as video-to-video diﬀusion task, this mo dule curates a series of abstract scene representations (corre- sp onding to the num b er of frames in the diﬀused video) to condition the diﬀusion. Firstly , it employs the geometric abstraction mo dule on all the frames of template videos of a known OR even t. The resulting series of abstract representations is then used as conditioning on the initial scene to diﬀuse the synthetic OR even t. Alter- nativ ely , the mo dule also oﬀers the ﬂexibility to alter tra jectories of one or more en tities using user-deﬁned tra jectories, to diﬀuse syn thetic at ypical/rare/safet y-critical OR ev ents. W e implement an in teractive tra jectory drawing to ol (using Op enCV, Pygame, and Tkin ter) that allo ws users to select ellipses from the abstract geometric represen tation by clicking on them, then sketc h desired mo vemen t paths by drawing freehand tra jectories. The tool captures wa yp oin ts along the drawn tra jectory , which A b st r a ct i o n M o d u l e C o n d i t i o n i n g f o r R o u t i n e O R E v e n t s C o n d i t i o n i n g f o r C o u n t e r f a c t u a l O R E v e n t s C o n d i t i o n i n g M o d u l e + Fig. 3 Interactiv e conditioning module for counterfactual even t generation. Given an input OR video sequence, the Abstraction mo dule con verts the scene in to an abstract geometric representation. A graphical user interface enables direct manipulation of these ellipsoids through drag-and-drop operations to sketc h desired tra jectories. The Conditioning Module transforms the original geometric sequence into a counterfactual even t by incorp orating the user-modiﬁed tra jectories. 5 are interpolated across the full video sequence and applied as translational oﬀsets to the selected ellipse’s centroid p osition. Fig. 3 illustrates an example of video condi- tioning using user-deﬁned tra jectories: OR-p ersonnel w alking around the instrument table, instead of mo ving tow ards the patien t. (iii) Diﬀusion mo dule: W e emplo y L TX-Video [ 17 ] – a transformer-based latent video diﬀusion model as the diﬀusion bac kb one mo del. Sp eciﬁcally , w e ﬁne-tuned using the In-Con text LoRA (IC-LoRA) pip eline, that allo ws for video-video diﬀusion b y conditioning on reference frames (abstract geometric scene representation). During ﬁne-tuning, in addition to baseline implementation, we incorp orate Patc hGAN loss [ 18 ] to further improv e the ﬁdelit y and realism of the synthesized video. The ﬁne-tuned IC-LoRA weigh ts, trained sp eciﬁcally for in-context conditioning, enable the mo del to in terpret these rendered visualizations as structural guidance during generation. 4 Exp erimen ts and Results 4.1 Exp erimen tal setup: (i) Dataset : W e emplo y t wo public datasets: MMOR [ 16 ] and 4DOR [ 19 ]. The diﬀu- sion mo del is trained and v alidated on videos from the MMOR dataset. With original videos av ailable at 1 fps, we temp orally interpolate it to 24 fps using the L TX’s k eyframe interpolation feature [ 17 ]. T o maintain segmen tation consistency across the in terp olated video, we employ SAM2 to segmen t OR entities based on the ﬁrst-frame groundtruth annotations. Videos are pro cessed at 1024 × 768 resolution, with 97 frames eac h. The train/test split w as assigned on a video-wise basis: 338 videos for ﬁne-tuning the framew ork and 50 videos for a detailed ablation study . F or baseline in- and out- of-domain c omparison against baseline conditional diﬀusion models, we used 6 videos MMOR testset and 6 videos from 4DOR. (ii) T raining and inference. Diﬀusion model training and inference: The IC-LoRA- adapted video diﬀusion model is trained on a single NVIDIA A100 GPU for 8000 steps. W e adopt the default h yp erparameters from L TX’s video style transfer conﬁguration: LoRA rank and alpha of 128, learning rate of 2 × 10 − 4 , AdamW optimizer, and bﬂoat16 mixed precision training. Inference is p erformed using 50 denoising steps with a guid- ance scale of 3.5. During training, ﬁrst-frame conditioning is provided in 20% of cases to encourage b oth conditional and unconditional generation capabilities. A t inference time, ﬁrst-frame conditioning is p erformed exclusively , providing the initial frame of the target video alongside the complete rendered abstract scene representations. (iii) Ev aluation Metrics : (a) With Groundtruth videos: W e use FVD, SSIM, and PSNR, and LPIPS metrics. SSIM and PSNR are used to quantify the av erage video quality and degradation of the generated videos against the groundtruth videos. FVD summarizes set-level spatio-temporal realism. Additionally , to ev aluate structural accuracy and alignement with abstract conditioning, we use b ounding b o x IoU (BB IoU) and segmen tation mask IoU (Seg IoU). These metrics quantify controllabilit y b y measuring the spatial alignmen t betw een the conditioned ellipsoid tra jectories and the corresp onding en tit y p ositions in the generated video. W e prompt each instance in the groundtruth initial frame for generating segmentation masks using SAM2 and track 6 T able 1 Comparison of our framework against out-of-the-b o x baseline mo dels on in- and out-of-domain testsets. W AN [ 12 ] & L TX-base (L TX b ) [ 17 ]: T ext-conditioned generation using VLM descriptions of the groundtruth scene. SVD [ 11 ]: Image-to-video generation with low dynamic motion setting. Ours: Our prop osed framework. Metho d FVD ↓ SSIM ↑ PSNR ↑ LPIPS ↓ FVD ↓ SSIM ↑ PSNR ↑ LPIPS ↓ MMOR (In-Domain) 4DOR (Out-of-Domain) W AN [ 12 ] 1190.57 0.78 18.95 0.20 699.78 0.86 21.72 0.13 SVD [ 11 ] 5021.19 0.68 17.91 0.24 3790.73 0.74 19.86 0.16 L TX b [ 17 ] 2439.33 0.47 12.88 0.58 1135.26 0.46 13.10 0.58 Ours 689.88 0.86 23.21 0.13 265.25 0.90 25.87 0.07 G r o u n d t r u t h O u r s S V D L T X - B a se W A N - G e o m e t r i c I m g 2 V i d T e x t 2 V i d T e x t 2 V i d Fig. 4 Qualitative comparison of video synthesis metho ds on out-of-domain (4DOR) dataset. Groundtruth: Original video frames to reconstruct. W AN [ 12 ] & L TX-Base: T ext-conditioned gen- eration using VLM descriptions of the groundtruth scene. SVD: Image-to-video generation with low dynamic motion setting. Ours: Our proposed video synthesis using abstract geometric represen tation. them across b oth the real and generated video sequences. By comparing the result- ing segmentation masks and b ounding b o xes b et ween real and generated sequences, w e measure spatial alignment and comp onen t localization accuracy . (b) Without groundtruth videos: we employ DOVER [ 20 ] and Inception Score [ 21 ] to quantify per- formance. (c) Downstream near-miss detection task: W e prioritize recall (sensitivit y) o ver accuracy , as missing a true violation (false negative) p oses greater clinical risk than a false alarm (false positive), whic h simply prompts staﬀ v eriﬁcation. 4.2 Results (i) Baseline comparison: Firstly , we b enc hmark our abstract geometric conditioned OR diﬀusion framework on in- and out-of-domain testsets against out-of-the-box base- line mo dels: (i) SVD [ 11 ] that p erforms image to video diﬀusion, and (ii) W AN [ 12 ] and L TX-base [ 17 ] that condition video via text-prompt. Quantitativ ely (T able. 1 ), 7 T able 2 Quan titative assessment of our framework’s ability in curating synthetic rare/atypical/safet y-critical OR even ts. Method DOVER ↑ Inception Score ↑ DragNUW A [ 22 ] 0.31 1 . 04 ± 0 . 05 Ours 0.52 1 . 03 ± 0 . 01 T able 3 Detecting near safet y-critical even ts (near misses of sterile-ﬁeld violation) using mo dels trained on synthetic data. Method Accuracy ↑ RECALL ↑ ResNet34 [ 23 ] 64.91 50.65 ViT-B/16 [ 24 ] 67.54 70.13 our framew ork – ﬁne-tuned on a small trainset (338 videos; 97 frames eac h) – performs w ell on b oth in-domain and out-of-domain testsets. Fig. 4 shows the qualitative p er- formance of our framework against baseline mo dels on out-of-domain videos. These results demonstrate the eﬀectiv eness of our abstract geometry conditioning OR video generation framework in conditioning the OR video synthesis at every frame, for each en tity . (ii) Syn thesizing rare/atypical/satefy-critical ev en ts: T o demonstrate the framew ork’s ﬂexibilit y in controlled videos synthesis of rare/at ypical/safety-critical ev ents, which w ould otherwise b e diﬃcult to generate without straining the workforce or risking patien t harm, we qualitatively (Fig. 5 ) and quantitativ ely (T able. 2 ) assess its p erformance. Quantitativ ely , we show that our framework p erforms b etter than out-of-the-b o x DragNUW A [ 22 ] – a baseline mo del that also conditions video genera- tion through user-deﬁned sketc h. W e show that, by using an interactiv e conditioning to ol, where ellipsoids (geometric represen tation of OR en tities) can b e manipulated and dragged to generate new even ts, the framework can diﬀuse coun terfactual OR ev ents. Qualitatively , we observe that, while our framework allows for explicit spatial conditioning of entit y tra jectories, in some cases, it has also implicitly learned the in teractions b etw een en tities in the training distributions based on spatial proximit y . F or instance, when an OR p ersonnel is conditioned to mov e tow ards the instrument table, the framew ork diﬀuses an OR even t video, where the p erson is seen interacting with the table. (iii) Dev eloping OR am bient intelligence mo del for detecting near safety- critical ev ents from syn thetic data: Considering sterile-ﬁeld violations can p oten tially compromise patient outcomes, w e deﬁne near misses in sterile-ﬁeld vio- lations as scenarios in whic h non-sterile p ersonnel approac h the sterile ﬁeld without making contact, and treat thes e instances as near safet y-critical even ts. Using our trained framework, we curate a synthetic dataset to train and v alidate AI mo dels for detecting near-misses of sterile-ﬁeld violations. Using 20 of the 50 MMOR testset videos, w e curated 87 synthetic videos, depicting positive and negativ e samples for near-misses of sterile-ﬁled violations. Image frames from 68 of these syn thetic videos w ere used to train the model. F rames from the remaining videos were used for mo del v alidation. The synthetic dataset comprises 678 training frames (252 p ositiv e, 426 negativ e) and 228 v alidation frames (77 p ositiv e, 151 negative). Positiv e samples rep- resen t frames where non-sterile p ersonnel are in close pro ximity to the sterile ﬁeld, while negativ e samples represen ts a OR scene where non-sterile p ersonnel maintain a safe distance from the sterile ﬁeld. The near-miss detection model is a p er-frame image 8 O r i g i n a l E v e n t C o u n t e r f a c t u a l O r i g i n a l E v e n t C o u n t e r f a c t u a l O r i g i n a l E v e n t C o u n t e r f a c t u a l Fig. 5 Controllable synthesis of safety-critical, interactions, and alternate OR even ts. Each col- umn pair shows a routine OR even t (left) with its abstract geometric representation (top), and a counterfactual even t (right) generated by providing a tra jectory for geometric conditioning. Left pair (safet y-critical even t): A non-sterile assistant approaches the sterile instrument table. Middle pair (Interaction): Personnel walking tow ard and reac hing for interaction with the table. Right pair (Alternate even t): Mo diﬁed tra jectory where p ersonnel w alks directly tow ard the patient b ed instead of the original path around the room. S c e n e 1 S c e n e 2 Fig. 6 Left: abstract geometric conditioning. Right: synthesized video frames. Positiv e (red) and negative (green) training samples for near safety-critical event (near misses in sterile-ﬁeld violation) detection from coun terfactual syn thetic data generated using our framew ork. Tw o scenes demonstrate near-miss progressions: Scene 1 shows non-sterile p ersonnel approaching then retreating from the instrument table. Scene 2 shows p ersonnel passing close to the instrument table. classiﬁer that op erates on individual frames without tra jectory history , detecting near- misses based on spatial pro ximit y within each frame. Fig. 6 shows p ositiv e and negativ e samples of these frames generated from conditioning the tra jectories of entities using the framew ork. T able 3 summarizes the p erformance of classiﬁers trained and v alidated on these syn thetic samples for detecting near-misses of sterile-ﬁeld violations. 9 T able 4 Ablation study of our framework with and without base (diﬀusion backbone and baseline ﬁne-tuning), Seg (segmentation mask-based conditioning), E (ellipsoids-based conditioning), D (depth encoded ellpsoids) and L g (Patc hGAN loss integrated for ﬁnetuning). Method FVD ↓ SSIM ↑ PSNR ↑ LPIPS ↓ BB IoU ↑ Seg IoU ↑ Base Seg E D L g ✓ ✓ 347.88 0.88 25.34 0.09 0.96 0.95 ✓ ✓ 518.50 0.86 23.65 0.12 0.93 0.90 ✓ ✓ ✓ 532.05 0.86 23.74 0.12 0.93 0.90 ✓ ✓ ✓ ✓ 487.20 0.88 24.71 0.11 0.93 0.91 (iv) Ablation Study: W e p erform extensive ablation study using all 50 MMOR testset videos to v alidate the key comp onents of our framework. T able 4 compares three conditioning approaches: direct segmentation maps, ellipse rendering without depth, and our prop osed ellipse rendering with depth enco ding and adv ersarial train- ing. While segmentation map conditioning achiev es sup erior reconstruction metrics (FVD: 347.88), this comes at the cost of controllabilit y as segmentation masks are not easy to manipulate (such as ﬁne-grained conditioning the lim bs) or mov e around. Our ellipse-based represen tation main tains strong absolute performance (SSIM > 0.88, and BBo x IoU > 0.93) while enabling ﬂexibility in conditioning and scene composition. Adding Patc hGan loss ( L g ) further enhances p erformance to an FVD of 487.20 and segmen tation IoU of 0.91. 5 Discussion and Conclusion In this work, w e in tro duced an OR video diﬀusion framework that reformulates a video-to-video diﬀusion task as an OR ev ent diﬀusion model conditioned on abstract geometry scene represen tation. By abstracting the input scene and routine OR even ts in to a visualizable geometric representation, and coupling it with an in teractive condi- tioning mo dule, our framework oﬀers a ﬂexible and con trolled diﬀusion of OR even ts. This unlo c ks synthetic video generation of routine, atypical, rare, and safety-critical OR even ts, at scale, that otherwise are diﬃcult/near-imp ossible to curate due to strain on the workforce and risk to patient outcome. W e demonstrate that our framework – ﬁne-tuned on a small public dataset of 338 videos – outp erforms out-of-the-b o x baseline mo dels b oth quantitativ ely and qualitatively on small in-domain and out- of-domain test sets. W e also show case our framew ork’s controllabilit y in generating at ypical/rare/safety-critical OR even ts using abstract geometric scene conditioning. Additionally , w e show our framework’s p oten tial in generating synthetic data tow ards training AI models for detecting near safety-critical ev en ts – near misses of sterile-ﬁeld violation. With this w ork serving as a groundwork tow ard scalable data curation for devel- oping OR am bient intelligence for OR workﬂo w analysis, k ey limitations exist that need to b e progressiv ely addressed. (i) Conditioning and c ontr ol lability tr ade oﬀ: W e selected ellipsoids as geometric primitives to enable in tuitiv e drag-and-drop tra jectory con trol while b eing robustly extractable from segmen tation masks, unlike articulated p ose representations that w ould require complex in terfaces and are prone to failure 10 in cluttered OR scenes. This design choice provides suﬃcien t granularit y for spatial- conditioning to enforce proximit y/mo vemen t of OR p ersonnel near OR devices, but is limited in enforcing explicit ﬁne-grained articulation control (e.g., arm extension of OR p ersonnel when reac hing for instruments). In the curren t framework, the gen- erativ e mo del implicitly learns articulation and interaction priors from the training data distribution, guided b y spatial-proximit y conditioning. (ii) Gener alization and r obustness: Our framework demonstrated generalizability to out-of-domain dataset (4DOR) for routine even t synthesis. How ever, challenges remain in generating syn- thetic videos for signiﬁcantly diﬀerent en vironments due to v ariations in sterile attire colors, equipment conﬁgurations, and surgeries (e.g., op en surgery , emergency trauma) not represented in the MMOR training data. (iii) Clinic al validation and downstr e am utility: While clinical collab orators were consulted throughout the dev elopment stages, with this study b eing groundwork tow ards developing scalable syn thetic data gener- ation for OR ambien t intelligence, formal domain-exp ert ev aluation (e.g., structured ratings by indep enden t surgeons) is b ey ond the scop e of this work. The do wnstream am bient AI mo del – near critical-ev ent detection mo del – trained and v alidated on syn thetic data, serves as a pro of-of-concept. Comprehensive ev aluation on real OR images and the impact of am bient in telligence in enhancing hospital performance – clinical and ﬁnancial – remains to be studied. F uture w ork includes: (i) further improving the video-diﬀusion mo del’s p erfor- mance. Although integrating a P atchGAN loss during ﬁne-tuning has impro ved ﬁdelit y , the resolution and clarity of moving entities degrade as the video progresses and deviates from the groundtruth even t; architectural mo diﬁcations to enhance tem- p oral ﬁdelity and consistency are a promising direction. (ii) In tro ducing scalable, in tuitive conditioning for explicit ﬁne-grained articulation conditioning (e.g., OR p er- sonnel reaching for instruments), extending b ey ond the current framew ork’s explicit spatial (tra jectory) conditioning. (iii) Reducing reliance on manual SAM2 prompts at inference b y developing a scalable data-curation pip eline that automates geometric abstraction via zero-shot en tity detection, enabling large-scale processing of public OR datasets. (iv) Comprehensive v alidation, including clinical v alidation, on framew ork generalizabilit y to real and diverse OR environmen ts. Declarations F unding: This work was funded b y the National Science F oundation, under Gran t No. 2239077. The conten t is solely the resp onsibilit y of the authors and do es not necessarily represen t the oﬃcial views of the National Science F oundation. Comp eting in terests: The authors hav e no competing interests. Ethics appro v al: This is a computational study inv olving no human participan ts or animals and is based on publicly av ailable datasets. No ethical appro v al was required. Informed consent: Not applicable. Author contributions: The ﬁrst t wo authors contributed equally to this w ork. References [1] Saeedian, M., Sep ehri, M.M., Jalalimanesh, A., Shadp our, P .: Op erating room 11 orc hestration by using agent-based sim ulation. P eriop erativ e care and op erating ro om management 15 , 100074 (2019) [2] Healey , T., El-Othmani, M.M., Healey , J., Peterson, T.C., Saleh, K.J.: Improving op erating room eﬃciency , part 1: general managerial and preop erativ e strategies. JBJS reviews 3 (10), 3 (2015) [3] Phan, K., Kim, J.S., Kim, J.H., Somani, S., Di’Capua, J., Do wdell, J.E., Cho, S.K.: Anesthesia duration as an indep enden t risk factor for early p ostoperative complications in adults undergoing electiv e acdf. Global spine journal 7 (8), 727– 734 (2017) [4] Cheng, H., Chen, B.P .-H., Soleas, I.M., F erko, N.C., Cameron, C.G., Hinoul, P .: Prolonged op erativ e duration increases risk of surgical site infections: a systematic review. Surgical infections 18 (6), 722–735 (2017) [5] Ko c h, A., Burns, J., Catchpole, K., W eigl, M.: Associations of workﬂo w disrup- tions in the op erating ro om with surgical outcomes: a systematic review and narrativ e synthesis. BMJ Qualit y & Safet y 29 (12), 1033–1045 (2020) [6] Meara, J.G., Leather, A.J., Hagander, L., Alkire, B.C., Alonso, N., Ameh, E.A., Bic kler, S.W., Conteh, L., Dare, A.J., Davies, J., et al. : Global surgery 2030: evidence and solutions for achieving health, w elfare, and economic developmen t. The lancet 386 (9993), 569–624 (2015) [7] W eiser, T.G., Regenbogen, S.E., Thompson, K.D., Haynes, A.B., Lipsitz, S.R., Berry , W.R., Gaw ande, A.A.: An estimation of the global volume of surgery: a mo delling strategy based on av ailable data. The lancet 372 (9633), 139–144 (2008) [8] Vladu, A., Ghitea, T.C., Daina, L.G., T , ˆ ırt , , D.P ., Daina, M.D.: Enhancing op erating ro om eﬃciency: The impact of computational algorithms on surgical sc heduling and team dynamics. In: Healthcare, vol. 12, p. 1906 (2024). MDPI [9] Saito, M., Matsumoto, E., Saito, S.: T emp oral generative adversarial nets with singular v alue clipping. In: Pro ceedings of the IEEE International Conference on Computer Vision, pp. 2830–2839 (2017) [10] Peebles, W., Xie, S.: Scalable diﬀusion mo dels with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205 (2023) [11] Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitc h, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., V oleti, V., Letts, A., et al.: Stable video diﬀusion: Scaling laten t video diﬀusion mo dels to large datasets. arXiv preprint (2023) [12] W an, T., W ang, A., Ai, B., W en, B., Mao, C., Xie, C.-W., Chen, D., Y u, F., 12 Zhao, H., Y ang, J., et al.: W an: Op en and adv anced large-scale video generative mo dels. arXiv preprint arXiv:2503.20314 (2025) [13] Chen, T., Y ang, S., W ang, J., Bai, L., Ren, H., Zhou, L.: Surgsora: Ob ject-a ware diﬀusion mo del for con trollable surgical video generation. In: International Con- ference on Medical Image Computing and Computer-Assisted In terven tion, pp. 521–531 (2025). Springer [14] Ravi, N., Gab eur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment an ything in images and videos. arXiv preprin t arXiv:2408.00714 (2024) [15] Chen, S., Guo, H., Zhu, S., Zhang, F., Huang, Z., F eng, J., Kang, B.: Video depth an ything: Consisten t depth estimation for sup er-long videos. In: Pro ceedings of the Computer Vision and P attern Recognition Conference, pp. 22831–22840 (2025) [16] ¨ Ozso y , E., Pellegrini, C., Czempiel, T., T ristram, F., Y uan, K., Bani-Harouni, D., Ec k, U., Busam, B., Keic her, M., Nav ab, N.: Mm-or: A large multimodal operating ro om dataset for seman tic understanding of high-intensit y surgical environmen ts. In: Pro ceedings of the Computer Vision and P attern Recognition Conference, pp. 19378–19389 (2025) [17] HaCohen, Y., Chiprut, N., Brazo wski, B., Shalem, D., Moshe, D., Ric hardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video laten t diﬀusion. arXiv preprint arXiv:2501.00103 (2024) [18] Isola, P ., Zh u, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi- tional adversarial net works. In: Proceedings of the IEEE Conference on Computer Vision and P attern Recognition, pp. 1125–1134 (2017) [19] ¨ Ozso y , E., ¨ Ornek, E.P ., Eck, U., Czempiel, T., T ombari, F., Nav ab, N.: 4d- or: Semantic scene graphs for or domain mo deling. In: International Conference on Medical Image Computing and Computer-assisted Interv ention, pp. 475–485 (2022). Springer [20] W u, H., Zhang, E., Liao, L., Chen, C., Hou, J., W ang, A., Sun, W., Y an, Q., Lin, W.: Exploring video quality assessmen t on user generated conten ts from aes- thetic and technical p erspectives. In: Proceedings of the IEEE/CVF In ternational Conference on Computer Vision, pp. 20144–20154 (2023) [21] Salimans, T., Go o dfello w, I., Zarem ba, W., Cheung, V., Radford, A., Chen, X.: Impro ved techniques for training gans. Adv ances in neural information pro cessing systems 29 (2016) [22] Yin, S., W u, C., Liang, J., Shi, J., Li, H., Ming, G., Duan, N.: Dragnu w a: Fine- grained control in video generation b y integrating text, image, and tra jectory . 13 arXiv preprin t arXiv:2308.08089 (2023) [23] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Pro ceedings of the IEEE Conference on Computer Vision and P attern Recognition (CVPR), pp. 770–778 (2016) [24] Dosovitskiy , A., Beyer, L., Kolesniko v, A., W eissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly , S., Uszkoreit, J., Houlsby , N.: An image is w orth 16x16 w ords: T ransformers for image recognition at scale. arXiv preprin t arXiv:2010.11929 (2020) 14

Towards Controllable Video Synthesis of Routine and Rare OR Events

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment