PhysVid: Physics Aware Local Conditioning for Generative Video Models

Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real-world settings. Prior attempts to inject physics rely on conditioning: frame-level signals are domain-specific and short-ho…

Authors: Saurabh Pathak, Elahe Arani, Mykola Pechenizkiy

PhysVid: Physics Aware Local Conditioning for Generative Video Models
PhysV id: Physics A war e Local Conditioning f or Generativ e V ideo Models Saurabh Pathak Elahe Arani Mykola Pechenizkiy Bahram Zonooz Eindhov en Uni v ersity of T echnology { s.pathak,e.arani,m.pechenizkiy,b.zonooz } @tue.nl 5aurabhpathak.github .io/Ph ysV id “A car gliding over a road sl ick with r ainwater .” “A wine bottle pours a red bl end into a g lass.” PhysV id Wa n - 14B PhysV id Wa n - 14B “A blender spins, mixing squeezed j uice within it.” “A waterfall cascades over jagged r ocks.” Figure 1. V ideos generated by PhysV id with 1.7 billion parameters, compared to videos generated by W an-14B [ 47 ] on V ideoPhy [ 3 ] captions. Despite the smaller model size, PhysV id achie ves better physical realism in generated videos. Abstract Generative video models achie ve high visual fidelity b ut of- ten violate basic physical principles, limiting reliability in r eal - world settings. Prior attempts to inject physics r ely on conditioning: frame - level signals are domain - specific and short - horizon, while global text pr ompts are coarse and noisy , missing fine - grained dynamics. W e pr esent PhysV id, a physics - awar e local conditioning scheme that oper ates over temporally contiguous chunks of fr ames. Eac h chunk is annotated with physics - gr ounded descriptions of states, interactions, and constraints, which are fused with the global pr ompt via c hunk - awar e cr oss - attention during training . At infer ence, we introduce ne gative physics pr ompts (descrip- tions of locally r elevant law violations) to steer gener ation away fr om implausible trajectories. On V ideoPhy , PhysV id impr o ves physical commonsense scor es by ≈ 33% over base- line video gener ators, and by up to ≈ 8% on V ideoPhy2. These r esults show that local, physics - awar e guidance sub- stantially incr eases physical plausibility in generative video and marks a step towar d physics-gr ounded video models. 1. Introduction Generativ e video models ha ve seen a remarkable improv e- ment in aesthetic realism and video quality in the past few years, exempl ified by successful commercial models such as Sora [ 8 ] and Genie [ 9 ]. Ho we v er , despite being trained on enormous datasets, they still face dif ficulties in generating videos that faithfully adhere to the physical laws observed in nature and inherent in the data [ 30 , 38 ]. This inability points to the challenge and possibly the existence of a fundamental ceiling on learning to generate physically accurate videos from data alone, without any explicit mechanism to incorpo- rate the underlying physics. This problem has been ackno wl- edged in the literature and methods ha ve been proposed to improv e the physical accuracy of generated videos by incor- porating explicit physics-based constraints or models into 1 the generation process [ 18 , 20 , 35 , 49 , 56 , 58 , 61 , 63 , 65 , 66 ]. Howe v er , these methods apply physics conditioning at the entire time scale of a video, which limits their ability to capture fine-grained physical phenomena that ev olve over shorter time scales. T o address this limitation, we aim to discov er the physics information that arises at temporally local lev els in the data and inject it as an additional sequence- aware conditioning in the generati ve architecture, distinct from the traditional T ext-to-V ideo (T2V) pathway . The key inspiration to focus on local temporal se gments during the video generation process comes from the obser - vation that global text conditioning may be insufficient to capture the intricate physical interactions that occur over subintervals. Previous approaches ha ve focused on enhanc- ing global prompts with physics-based information [ 56 , 65 ]. Ho wev er , doing so does not guarantee that the model will fo- cus on rele v ant details within the appropriate subinterv al of the video being generated. Although effecti v e for static im- age generation, recent research has shown that global cross- attention mechanism that applies the same textual guidance across all frames can be suboptimal for video generation, as the model may struggle to interpret the temporal logic of the prompt, leading to a failure to generate details specific to local time intervals [ 16 , 36 , 55 , 67 ]. This limitation is demon- strated in models where temporally consistent textual guid- ance results in nearly static attention maps for action-related words over time, causing the generated video to exhibit static or incoherent motion due to the spatiotemporal misalignment of global conditioning with generated frames [ 16 , 45 ]. For this reason, the motion of objects, changes in lighting con- ditions, and interactions between elements in a scene often occur rapidly and can be better described when considered in smaller intervals. By conditioning each temporal seg- ment on relev ant physics principles within it, we can ensure that it adheres to physical la ws more closely , resulting in a coherent and realistic ov erall video. W e are also inspired by recent progress in video-based world modeling, where video generation is dynamically controlled with frame-level modulations [ 14 , 17 , 21 , 25 , 27 , 28 , 60 , 67 ]. W e extend this idea of frame-lev el control to physics conditioned video generation ov er short temporal fragments. The proposed approach PhysV id , in volves the following steps: First, the tar get video is segmented into smaller tempo- ral fragments. Next, the observ able physical phenomena in each segment are analyzed, identifying ke y physical dimen- sions such as motion dynamics, shape deformations, and opti- cal ef fects with the help of a V ision Language Model ( VLM ). This information is used to annotate each segment with a corresponding physics-aw are prompt to directly support its content during generation. Finally , we train a video genera- tion model with temporally aware cross-attention layers that incorporate the segment-le v el physics-based prompts along- side the global text prompt. This allows the model to respond to both the global context and local physical phenomena dur- ing generation. W e validate our approach through extensi v e experiments on the WISA-80k dataset [ 49 ]. In summary , we propose the following ke y contributions. • W e incorporate additional text conditioning pathways into a T2V generator . In contrast to frame-lev el action condi- tioning and global te xt conditioning, our method acts on groups of frames. W orking at the chunk 1 lev el preserv es sufficient temporal information necessary to observe phys- ical laws locally , such as motion, while av oiding locally irrelev ant pieces of information from the global text. • W e create a separate text prompt for each chunk using a VLM . During generation, each chunk is supported by its o wn physics based text conditioning in addition to the global T2V prompt. • During inference, we also generate counterf actual prompts for each chunk based on the violation of locally observ- able physics laws. W e use these prompts to guide the video generation away from the physically implausible scenarios. In the following sections, we first describe the back- ground, followed by a description of PhysV id. W e then report the results of our experiments and conclude with a discussion section. 2. Background 2.1. Generative T ext-to-V ideo modeling Sev eral methods to generate videos from text descrip- tion have been proposed in the literature. Earlier meth- ods that laid the foundation for conditional video gen- eration were based on Generativ e Adversarial Networks ( GAN s) [ 15 , 32 , 33 ]. Howe v er , these approaches suf fered from training instability and temporal consistency . Later work shifted to wards transformer based autore gressi ve T2V generation [ 26 , 46 , 52 , 53 ]. These models, characterized by sequential prediction of discrete video tokens, can gen- erate videos faster than their predecessors. Ho we ver , they are susceptible to rapid accumulation of errors, leading to degradation of temporal coherence o ver long sequences. In contrast, dif fusion based methods a v oid the need for discrete tokenization by operating in continuous spaces. A large body of work has adapted existing T ext-to-Image ( T2I ) ar- chitectures for video tasks, bypassing the need for massi v e text-video datasets [ 1 , 5 – 7 , 19 , 31 , 43 , 48 , 50 , 54 , 62 , 68 ]. This is achieved by modifying the internal mechanisms of T2I models, such as by augmenting with additional tem- poral layers, structuring latents, or attention techniques to create temporal consistency . W ith increased av ailabil- ity of large T2V datasets such as OpenV id [ 39 ], W eb- V id [ 2 ] and Panda [ 13 ], a growing line of work has fo- 1 W e use the term “chunk” to refer to a temporally contiguous set of frames from a video. 2 cused on spatiotemporal diffusion by directly modeling videos in 4D pix el or latent spaces, thus denoising all frames jointly without relying on image based spaces or T2I back- bones [ 23 , 24 , 37 , 42 , 59 ]. Beyond diffusion, recent work has begun to explore flow - matching techniques [ 34 ] for video generation, requiring much fewer generation steps than diffusion based models [ 11 , 12 , 29 , 41 , 47 ]. In this con- text, we consider a T2V generativ e model G that iterati vely transforms ov er T time steps, a 4D Gaussian noise sample x T ∼ N (0 , I ) ∈ R F × C × H × W under a conditional text prompt c , into a video x 0 that follows the content described in c . Cross attention in T2V modeling. Architecturally , T2V models hav e progressed from 3D U - Net backbones [ 24 ] that denoise spatiotemporal volumes, to Dif fusion T rans- formers ( DiT s) [ 40 ] that scale more ef fecti vely and capture longer - range dependencies with space–time self-attention. The text conditioning in DiT benefits significantly from its cross-attention mechanism, where the visual features of the noisy latent video act as the ‘query’, while the embeddings from a text encoder serve as the ‘ke y’ and ‘v alue’. This allows the model to dynamically weigh the importance of different parts of the te xt prompt for dif ferent spatiotempo- ral locations in the generated video. Howe ver , in prev alent T2V pipelines, cross-attention operates globally . Every spa- tiotemporal token in the video latent attends to the same time - agnostic text tokens, thus textual guidance is applied across frames without frame specific conditioning. This ap- proach is inef ficient across temporal semantics that change ov er time, motiv ating our approach that ties different cap- tions to temporal segments for improv ed temporal alignment. 2.2. Physics-awar e video generation A recent line of research has focused on explicitly incor - porating physical principles into the video generation pro- cess. Existing approaches include the use of physics sim- ulators [ 18 , 35 ], the incorporation of physical constraints into loss functions during training [ 49 , 65 , 66 ], as guidance mechanisms during inference [ 20 , 63 ], or the use of mod- ular approaches that incorporate physical awareness into visual generation through multistage generation processes or specialized modules [ 35 , 49 , 56 , 58 , 65 ]. “Force Prompt- ing” [ 18 ] is a technique in which video generation is con- ditioned on explicit physical forces. This method allows a user to apply localized or global forces, such as pushes or winds, to an initial image. The model then generates a video sequence in which objects react according to these physical inputs, enabling a form of interactive and physically responsiv e video synthesis. DiffPh y [ 65 ] employs a Large Language Model ( LLM ) to analyze a te xt prompt and infer its underlying physical context, such as gravity , collisions, or momentum. The LLM generates an enhanced, physics- aware prompt that provides explicit guidance to the video VLM Global T ext Caption 𝑐 ! Physics Annotation 𝑐 " Physics Annotation 𝑐 # Physics Annotation 𝑐 $ VLM instruction to creat e physics aware annotation Vi d e o Figure 2. Procedure for generating ph ysics-grounded local prompts during data annotation. diffusion model. T o ensure that the final output adheres to these principles, a multimodal LLM acts as a supervisor , ev aluating the physical correctness of the generated frames, and guiding the model’ s training process. PhyT2V [ 56 ] is a training-free method that improv es the physical realism of videos generated through multiple rounds of generation and refinement. In this method, an initial T2V prompt is used to generate a video which is then captioned. An LLM then refines the prompt for the subsequent round based on mismatches between the caption and the current prompt. This approach ev entually improv es the physical plausibil- ity of the generated video, but requires several rounds of generation. Hao et al. [ 20 ] recently proposed a training-free approach that uses LLM to reason about the gov erning phys- ical principles corresponding to a T2V prompt and generates a counterfactual prompt that violates them. During inference, both the original and the counterfactual prompts are used in a guidance based generation mechanism similar to Classifier- Free Guidance ( CFG ) [ 22 ]. WISA [ 49 ] decomposes abstract physical principles into te xtual descriptions, qualitati ve cat- egories, and quantitativ e properties, and injects them via a mixture of physical experts and a physical classifier . It also curates a dataset that co vers di verse la ws in dynamics, thermodynamics, and optics to train and ev aluate physics compliance. V ideoREP A [ 66 ] transfers physics understand- ing from video foundation models to T2V models using cross distillation losses, aligning intraframe spatial and inter- frame temporal relations to improve physical commonsense without relying on physics specialized datasets. 3. Method The central objecti v e of the proposed approach is to impro ve the ov erall quality of observable ph ysical phenomena in the generated videos. T o that end, we incorporate additional te xt 3 Global T ext Caption 𝑐 ! Physics Annotation 𝑐 " Physics Annotation 𝑐 # Physics Annotation 𝑐 $ Concatenate Flow matching loss Predicted video 𝑥 %&" Noised video 𝑥 % Chunk A ware Cross Attention Local conte xt tokens 𝑄 𝐾 𝑉 RoPE Attention + Gate AdaLN Processed toke ns Processed toke ns Global Attenti on Block Local Attention Block Figure 3. Architecture of PhysV id sho wing local information pathw ays with chunk aw are cross-attention. Commonly applied procedures such as tokenization, latent encoding, and decoding are implicit and not sho wn. conditioning based on local physical phenomena observ ed within smaller temporal segments of the video. This local conditioning is used in conjunction with global T2V con- ditioning to enhance the physical realism of the generated videos. Specifically , giv en a global prompt c g , and a set of k local prompts C := { c 1 , . . . , c k } based on physics and aligned with c g , our physics-aw are generative video model G generates a video x 0 grounded in both c g and C , by iterati ve denoising ov er T steps. x T − 1 = G ( x T , c g , C, T ) x T ∼ N (0 , I ) (1) In this section, we first describe the procedure for generat- ing physics-based annotations for local conditioning. Subse- quently , we describe our chunk-aware cross-attention mecha- nism that powers the local conditioning pathway , inserted as additional layers in the base architecture. Lastly , we explain the inference process that in v olves counterfactual generation to enable guidance. 3.1. Annotation of video chunks Giv en a training dataset of videos paired with te xt captions, we first annotate each video with chunk-lev el physics-based prompts. T o do this, we di vide each video in the training data set into a set of contiguous fixed-duration temporal chunks, with a fixed number of frames comprising each chunk. As sho wn in Fig. 2 , each chunk is then separately analyzed by a VLM to identify the visible elements and physical phenom- ena contained within that segment. While analyzing each chunk separately in this manner helps to focus on the local information, it risks the generated annotations becoming mis- aligned with the global caption or ev en directly contradicting it in the worst case. W e address this by also providing the global T2V prompt to the VLM as part of its instructions when processing each chunk and encourage it to align its annotations with the visible content without contradicting the global T2V prompt. The VLM generates a structured description of physical phenomena within the chunk, focus- ing on visible physical phenomena. W e instruct the VLM to focus on three key categories of physics information: dynam- ics, shape, and optics. The instructions provided to the VLM encourage it to work step by step in a structured manner . This ensures that the descriptions are physically accurate and rel- ev ant to the video chunk. In addition, we apply constrained generation techniques to strictly enforce this structure in the output [ 51 ]. The structured output from the VLM is parsed to extract the rele v ant physics information, which is then con v erted into a concise text prompt corresponding to that video chunk. This prompt is used as local conditioning for that chunk during the training. An example of the prompt used to guide the VLM is provided in Appendix 7.1 . 3.2. Local conditioning with cross-attention Giv en the annotated dataset with chunk-le v el physics-based prompts, we train a model to incorporate this local condi- tioning. The general architecture of the model is illustrated in Fig. 3 and the procedure is described in Algorithm 1 . Specifically , we employ chunk-aware local cross-attention in which Rotary Positional Embeddings ( RoPE ) [ 44 ] is applied to both vision and text modalities. Similarly to standard self-attention, video query tokens are modulated by RoPE 4 LLM LLM Global T ext Caption 𝑐 ! Physics Annotation 𝑐 " Physics Annotation 𝑐 # Physics Annotation 𝑐 $ LLM instruction t o create physics aware annotation Counterfactual Ph ysics Annotation 𝑐 " % Counterfactual Ph ysics Annotation 𝑐 # % Counterfactual Ph ysics Annotation 𝑐 $ % LLM instruction t o create counterfac tual physics an notation Figure 4. Generation of local and counterfactual prompts during inference. parameterized by the 3D spatiotemporal grid (frame, height, width). Howe ver , we also apply RoPE to text ke y projec- tions with an identical frequency basis for both video query and text key projections. T o track positional awareness of chunks within the local textual information flow , we de- fine a text grid that includes a chunk axis aligned to the number of video chunks. In this manner , cross-attention logits become explicitly cross-modal position aware. It en- ables a video token to attend dif ferently to te xt information from a different chunk compared to te xt assigned to its o wn chunk. This design contrasts with con ventional text - to - video cross - attention, in which only video queries carry video positional encoding, and te xt ke ys follo w one - dimensional textual positions, thereby lacking frame - aligned coupling. These modules are then inserted into a pretrained model inside each transformer block and trained with flo w match- ing [ 34 ]. This chunk-wise design enforces local temporal neighborhoods, while a parallel global cross-attention path preserves long-range conditioning. In general, this design promotes temporally grounded te xt-video alignment while remaining compatible with standard T2V architectures. 3.3. Inference with local counterfactual guidance During inference, only the global T2V captions are av ail- able, since the videos must be generated from pure noise. Therefore, to supply the local conditioning pathw ays in the model, we generate the local captions from only the global text prompt, as sho wn in Fig. 4 . Here, we instruct a LLM to generate a set of prompts grounded in local physics for an “imagined” video clip using only the information present in the global text. These prompts are required to be temporally coherent. In addition, a local prompt is allowed to contain information that is not mentioned in the global prompt, as long as it does not break alignment with the global prompt or the preceding local prompts. It is important to note that dur- ing inference, the annotations corresponding to all chunks in a video are generated together , in contrast to the training data annotation process explained in Sec. 3.1 . This is done because of the lack of local visual data during inference and also to reduce overhead. The instruction used for LLM to generate these prompts is provided in Appendix 7.3 . Counterfactual generation. In addition to generating a prompt that accurately describes the physical phenomena in each chunk, we also generate a corresponding counterfactual that deliberately violates those phenomena using a process similar to Hao et al. [ 20 ] . T o create a counterfactual prompt, an LLM first identifies ke y visual and physics-rele vant ele- ments in the “original” local prompt generated previously . It then generates a counterfactual statement that directly con- tradicts these physics observations, while still being relev ant to the visual elements in the original. The generated coun- terfactual prompts are used during the inference stage to guide the model away from generating physically inaccurate content. The instruction used to generate the counterfactual prompt is shown in the Appendix 7.2 . During inference, we use both positiv e and counterfactual local prompts to guide the generation process. W e employ classifier-free guidance [ 22 ] at both global and chunk lev els, as follows. x T − 1 =(1 + w ) · G ( x T , c g , C, T ) − (2) w · G ( x T , c n , C ′ , T ) , where c n is a fixed global negati v e prompt similar to W an [ 47 ]. C ′ is the set of counterf actual prompts paired with the set of corresponding physics-based prompts C , and w is the guidance scale. Other terms are similar to Eq. ( 1 ). This strategy enhances the physical accuracy of the generated videos by reinforcing correct physics while discouraging in- correct representations, effecti v ely steering the model aw ay from generating content that violates physical laws. In the next section, we demonstrate that, equipped with the attrib utes described in this section, Ph ysV id is capable of inducing adherence to physical principles in the synthesized video content. 4. Experiments W e begin this section with an e xplanation of our setup and data preparation procedure and the choice of e valuation benchmarks. Subsequently , we present quantitative and qual- itativ e results on the e v aluation benchmarks, comparing our method with the baselines and the previous literature. Fi- nally , we present the results of the ablation experiments that analyze the impact of our method with and without the aid of counterfactual guidance during generation and compare it with plain finetuning. 4.1. Setup Data. W e use the WISA [ 49 ] data set that contains a div erse collection of more than 80 thousand videos related to v arious 5 physical phenomena observed in the world. Follo wing the configuration in W an [ 47 ], we remo ve videos less than 5 seconds long and divide the remaining videos into 5 second clips sampled at 832 × 480 at 16 frames per second, leading to a total of 81 frames per video. This results in approxi- mately 53 thousand video samples. Note that while WISA provides detailed physics annotations for each of their train- ing samples, we do not utilize them in our approach, relying instead on learning this information purely from the training videos. This is because these annotations are at a global le v el and do not focus on physical phenomena observed within smaller temporal se gments of the video. Furthermore, due to the division of long videos into 5 second clips, these global annotations become misaligned with the video clips and may lead to noisy conditioning if used directly . This strategy also has the added benefit of making our approach gener- alizable to other datasets that do not contain such explicit information about physical phenomena. Subsequently , we generate our o wn chunk-le vel physics-based prompts using the method described in Section 3 as well as a global text caption for the entire video. W e use V ideoLLama3-7B [ 64 ] for this task. First, we generate a global caption for the entire 5 second video clip. Thereafter , the video is di vided into 7 temporally contiguous chunks of frames of approximately 0 . 7 seconds each. Next, we generate a segment-le v el prompt using the video chunk along with the global caption as in- put to the VLM , processing each chunk separately in this manner . Examples of the generated prompts are provided in Appendix 8 . Model. W e introduce chunk-aware cross-att ention layers in each transformer block in a pretrained W an2.1 [ 47 ] model with 1 . 3 billion-base parameters. The entire architecture is trained in two stages. First, the base model is frozen, and only the newly added modules are trained for 1000 steps. Once they ha ve stabilized, the base layers are unfrozen, and the entire model is further trained for additional 2000 steps. W e use 4 GPUs for this task, leading to an effecti ve batch size of 64 samples per step. T o generate annotations during inference as discussed in Sec. 3 , we utilize V ideoLLama3-7B as a language model. 4.2. Evaluation benchmarks W e use two recently proposed and widely adopted bench- marks, V ideoPhy [ 3 ] and V ideoPhy2 [ 4 ], to ev aluate the physical accuracy of the generated videos. V ideoPhy con- sists of 344 manually curated captions in three different categories of physical interactions, namely , “solid-solid”, “solid-fluid”, and “fluid-fluid”. Similarly , V ideoPhy2 uses a larger test set with 590 captions providing co verage ov er a di verse set of real-w orld physical phenomena. Their data is di vided into two main action cate gories: “object inter- actions” and “sports and physical activities”. Both these benchmarks include an additional category comprising of T able 1. Results on V ideoPhy and V ideoPhy2. W e report semantic alignment (SA) and physical commonsense (PC; higher is better). PhysV id (1.7B) achieves the best PC on both benchmarks ( ≈ 33% relativ e gain on V ideoPhy and o ver 8% on V ideoPhy2 vs. W an- 14B). Method Params (B) V ideoPhy V ideoPhy2 SA PC SA PC W an-1.3B 1 . 3 0 . 46 0 . 24 0 . 28 0 . 61 W an-14B 14 0 . 52 0 . 24 0 . 29 0 . 59 PhysV id 1 . 7 0 . 43 0 . 32 0 . 28 0 . 64 Figure 5. V ideoPhy Physical Commonsense ( PC ) score by category . manually labeled “hard” examples. In addition, they provide an automatic e v aluation model that is trained on human anno- tations of the generated videos. As a result, their ev aluation scores are correlated with human judgment of physical cor - rectness. W e follo w the same scoring mechanism as outlined in the respecti ve original works. All results are reported as the mean of the benchmark scores for 5 different e v aluation sets generated with dif ferent random seeds. W e present the results next. 4.3. Results 4.3.1. Quantitative r esults V ideoPhy . T able 1 sho ws the performance of PhysV id on the V ideoPhy benchmark compared to two W an2.1 base- lines. W ith a model size of only 1.7B parameters, PhysV id significantly outperforms both the smaller (1.3B) and the much larger (14B) base model in the physical commonsense metric by ≈ 33% . Furthermore, this performance gain is reflected across all subcate gories in the benchmark as sho wn in Fig. 5 , demonstrating the ef fecti v eness of local informa- tion in impro ving physical a wareness of the generati ve video model. V ideoPhy2. As shown in T ab . 1 , PhysV id performs, re- 6 Figure 6. V ideoPhy2 Physical Commonsense ( PC ) score by cate- gory . spectiv ely , ≈ 5% and ≈ 8% better than the corresponding baselines on the physical commonsense score. As Fig. 6 sho ws, this improv ement is consistent across both “object in- teraction” and “sports and physical activities” subcategories of the benchmark, as well as on the captions that are cat- egorized by the benchmark as hard to generate accurately . Compared to V ideoPhy , the improv ements of our method are significantly less pronounced on V ideoPhy2. The reason behind this could be the dif ference in the score calculation method between the two ev aluation approaches. Specifically , V ideoPhy2 uses a cate gorical rating system in contrast to V ideoPhy , which uses a rating on a continuous scale of 0 and 1 , followed by hard thresholding [ 3 , 49 ]. Comparisons with existing approaches. T able 2 shows the performance of existing methods on the widely adopted V ideoPhy benchmark. W e collate results from previous work and ours. From this table, it can be observed that the physical accuracy of generative video models does not necessarily scale up with the model size. Furthermore, the performance of a generativ e video model may v ary significantly across multiple ev aluations. Although this could be due to differ- ences in underlying settings ( e.g ., number of denoising steps), it is important to remember that the V ideoPhy auto-e v aluator is trained to imitate human judgment on the generated videos, which is subjectiv e in nature and inherently noisy . Ne verthe- less, existing results sho w that with just a 1 . 7 billion total parameters, PhysV id remains competitiv e on the V ideoPhy benchmark and is on par with the current state-of-the-art, matching or ev en surpassing man y of the larger models. 4.3.2. Qualitative r esults Figure 1 shows qualitati ve results on V ideoPhy examples. V ideos generated with PhysV id sho w a visible improv ement in the physical fidelity of the content, in contrast to more than 8 × larger W an-14B model. Additional qualitati ve results are T able 2. Quantitati ve comparisons of our approach with previ- ous works e v aluated on the V ideoPhy benchmark, including both general and physics-aw are generati ve video methods. Symbols indi- cate results taken from prior work: WISA [ 49 ] ( † ),V ideoREP A [ 66 ] ( $ ), Hao et al. [ 20 ] ( # ),PhyT2V [ 56 ] ( ∼ ),V ideoPhy [ 3 ] ( ‡ ). Our own baselines are marked with ∗ . Best model scores are underlined. For physics-a ware methods with multiple entries, we report the best PC score, with parentheses showing relati ve impro vement o ver the corresponding baseline. Method SA PC Lavie ‡ 0 . 49 0 . 28 Lavie $ 0 . 49 0 . 32 V ideoCrafter2 † 0 . 47 0 . 36 V ideoCrafter2 ‡ 0 . 49 0 . 35 V ideoCrafter2 $ 0 . 50 0 . 30 V ideoCrafter2 ∼ 0 . 24 0 . 15 OpenSora # 0 . 38 0 . 43 OpenSora ‡ 0 . 18 0 . 24 OpenSora ∼ 0 . 29 0 . 17 HunY uanV ideo † 0 . 46 0 . 28 HunY uanV ideo $ 0 . 60 0 . 28 Cosmos-7B † 0 . 57 0 . 18 Cosmos-7B # 0 . 52 0 . 27 CogV ideoX-2B ‡ 0 . 47 0 . 34 CogV ideoX-2B $ 0 . 52 0 . 26 CogV ideoX-2B ∼ 0 . 22 0 . 13 CogV ideoX-5B † 0 . 60 0 . 33 CogV ideoX-5B ‡ 0 . 63 0 . 53 CogV ideoX-5B $ 0 . 63 0 . 31 CogV ideoX-5B # 0 . 48 0 . 39 CogV ideoX-5B ∼ 0 . 48 0 . 26 W an2.1-1.3B ∗ 0 . 46 0 . 24 W an2.1-14B # 0 . 49 0 . 35 W an2.1-14B ∗ 0 . 52 0 . 24 Physics-aw are approaches PhyT2V [ 56 ] 0 . 59(+23%) 0 . 42(+62%) PhyT2V † 0 . 61(+2%) 0 . 37(+12%) WISA [ 49 ] 0 . 67(+12%) 0 . 38(+15%) V ideoREP A-5B [ 66 ] 0 . 72(+14%) 0 . 40(+29%) Hao et al. [ 20 ] 0 . 49(+0%) 0 . 40(+14%) PhysV id-1.7B 0 . 43( − 7%) 0 . 32(+33%) av ailable in Appendix 9.1 . 4.3.3. Ablations W e perform analysis on the effecti v eness of applying lo- cally grounded ph ysics-based te xt conditioning in generati v e video modeling in contrast to learning physics information 7 T able 3. Ablations. The highest scores in each metric are highlighted, whereas the lowest scores are underlined. Method V ideoPhy V ideoPhy2 SA PC SA PC baseline (W an-1.3B) 0 . 4570 0 . 2401 0 . 2845 0 . 6144 fine tuning 0 . 4174 0 . 2866 0 . 2765 0 . 6261 PhysV id w/o counterfactual guidance 0 . 4355 0 . 2924 0 . 2791 0 . 6334 PhysV id 0 . 4302 0 . 3169 0 . 2775 0 . 6411 Prompt: “A mixing spoon stirring hot choc olate in a cup.” PhysV id PhysV id (without C ounterfactual Prompting) Baseline Finetuned Figure 7. Qualitativ e ablation example on a V ideoPhy prompt. Best viewed when zoomed in. purely from data and standard te xt conditioning. T o that end, we finetune 1.3B baseline with the same dataset but without adding any chunk-a ware cross attention layers. Fur- thermore, to understand the role of counterfactual physics guidance, we also tested the performance of our model both with and without using counterfactual guidance during infer- ence. When inference is performed without using counter- factual prompts, we apply blank local prompts to all local chunks during classifier -free guidance. W e e v aluate all the approaches on both V ideoPhy and V ideoPhy2. It is evident from the results in T ab . 3 that the use of our method pro- vides clear advantages o ver pure finetuning. In addition, applying negati ve physics conditioning to local pathways during guidance based generation further improv es the Phys- ical Commonsense ( PC ) score. This is also reflected in our qualitativ e analysis in Fig. 7 . 5. Discussion W e ha ve presented PhysV id , a method to improv e a ware- ness of physical phenomena in generativ e video modeling by injecting physics kno wledge into text prompts aligned with the local temporal segments of a video being gener- ated. W e extract this information from videos in the training data using a pretrained VLM . The VLM is instructed to annotate each chunk of frames with the relev ant physics information visible within the chunk. This information is injected into a pretrained generative video model with the help of additional cross attention blocks that emplo y RoPE to align each annotation with its corresponding chunk. W e ev aluate this approach on V ideoPhy and V ideoPhy2, tw o of the widely adopted benchmarks for the e v aluation of physi- cal plausibility in generativ e T2V models. The results show a clear improv ement in the physical fidelity of the videos generated across both benchmarks. Although we employ a context-specific dataset with labeled physics information, our approach instead relies on its ability to extract this in- formation directly from videos and can do so at a temporal granularity higher than that av ailable within the dataset. This quality makes our approach applicable to generic datasets. Further discussion of closely related works and the limita- tions of our method is provided in Appendix 6 . As generativ e video models embark on a journey to w ard genuine world simulators, ensuring that their output adheres to fundamental physical la ws becomes paramount for their application in high-stakes domains such as robotics, health- care, and autonomous systems. By introducing a method for localized, physics-aw are conditioning, this work contrib utes a meaningful step tow ard that ambitious goal. 8 Acknowledgments This work is supported by the EU funded SYNER- GIES project (Grant Agreement No. 101146542 ). W e also gratefully ackno wledge the TUE supercomputing team for providing the SPIKE-1 compute infrastruc- ture to carry out the experiments reported in this paper . References [1] Jie An, Songyang Zhang, Harry Y ang, Sonal Gupta, Jia-Bin Huang, Jiebo Luo, and Xi Y in. Latent-shift: Latent diffu- sion with temporal shift for efficient text-to-video generation. arXiv pr eprint arXiv:2304.08477 , 2023. 2 [2] Max Bain, Arsha Nagrani, G ¨ ul V arol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to- end retriev al. In Pr oceedings of the IEEE/CVF international confer ence on computer vision , pages 1728–1738, 2021. 2 [3] Hritik Bansal, Zongyu Lin, Tian yi Xie, Zeshun Zong, Michal Y arom, Y onatan Bitton, Chenfanfu Jiang, Y izhou Sun, Kai- W ei Chang, and Aditya Grov er . V ideophy: Evaluating phys- ical commonsense for video generation. In The Thirteenth International Confer ence on Learning Repr esentations , 2025. 1 , 6 , 7 [4] Hritik Bansal, Clark Peng, Y onatan Bitton, Roman Golden- berg, Aditya Grover , and Kai-W ei Chang. V ideophy-2: A challenging action-centric physical commonsense e valuati on in video generation. CoRR , abs/2503.06800, 2025. 6 [5] Omer Bar-T al, Hila Chefer , Omer T ov , Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur , Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia 2024 Confer ence P apers , pages 1–11, 2024. 2 [6] Andreas Blattmann, T im Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Y am Le vi, Zion English, V ikram V oleti, Adam Letts, V arun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets. CoRR , abs/2311.15127, 2023. [7] Andreas Blattmann, Robin Rombach, Huan Ling, T im Dock- horn, Seung W ook Kim, Sanja Fidler , and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. In Proceedings of the IEEE/CVF con- fer ence on computer vision and pattern r eco gnition , pages 22563–22575, 2023. 2 [8] T im Brooks, Bill Peebles, Connor Holmes, W ill DePue, Y ufei Guo, Li Jing, David Schnurr , Joe T aylor, T ro y Luhman, Eric Luhman, et al. V ideo generation models as world simulators. OpenAI Blog , 1(8):1, 2024. 1 [9] Jake Bruce, Michael D. Dennis, Ashley Edwards, Jack Parker -Holder , Y uge Shi, Edward Hughes, Matthew Lai, Aditi Mav alankar , Richie Steigerwald, Chris Apps, Y usuf A ytar , Sarah Bechtle, Feryal M. P . Behbahani, Stephanie C. Y . Chan, Nicolas Heess, Lucy Gonzalez, Simon Osin- dero, Sherjil Ozair , Scott E. Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Singh, and Tim Rockt ¨ aschel. Genie: Generativ e interacti ve en vironments. In ICML , 2024. 1 [10] Minghong Cai, Xiaodong Cun, Xiaoyu Li, W enze Liu, Zhaoyang Zhang, Y ong Zhang, Y ing Shan, and Xiangyu Y ue. Ditctrl: Exploring attention control in multi-modal diffusion transformer for tuning-free multi-prompt longer video gen- eration. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pages 7763–7772, 2025. 2 [11] Y ang Cao, Zhao Song, and Chiwun Y ang. V ideo latent flow matching: Optimal polynomial projections for video interpo- lation and extrapolation. In ICLR 2025 W orkshop on Deep Generative Model in Machine Learning: Theory , Principle and Efficacy , 2025. 3 [12] Shoufa Chen, Chongjian Ge, Y uqi Zhang, Y ida Zhang, Fengda Zhu, Hao Y ang, Hongxiang Hao, Hui W u, Zhichao Lai, Y ifei Hu, et al. Goku: Flo w based video generative foun- dation models. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pages 23516–23527, 2025. 3 [13] Tsai-Shien Chen, Aliaksandr Siarohin, W illi Menapace, Eka- terina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Y uwei Fang, Hsin-Y ing Lee, Jian Ren, Ming-Hsuan Y ang, et al. Panda-70m: Captioning 70m videos with multiple cross- modality teachers. In Pr oceedings of the IEEE/CVF Con- fer ence on Computer V ision and P attern Recognition , pages 13320–13331, 2024. 2 [14] Xinle Cheng, T ianyu He, Jiayi Xu, Junliang Guo, Di He, and Jiang Bian. Playing with transformer at 30+ fps via next- frame diffusion. arXiv pr eprint arXiv:2506.01380 , 2025. 2 [15] Kangle Deng, Tian yi Fei, Xin Huang, and Y uxin Peng. Irc- gan: Introspectiv e recurrent con volutional gan for text-to- video generation. In IJCAI , pages 2216–2222, 2019. 2 [16] Jiasong Feng, Ao Ma, Jing W ang, Bo Cheng, Xiaodan Liang, Dawei Leng, and Y uhui Y in. Fanc yvideo: T owards dynamic and consistent video generation via cross-frame textual guid- ance. CoRR , abs/2408.08189, 2024. 2 [17] Shenyuan Gao, Jiazhi Y ang, Li Chen, Kash yap Chitta, Y ihang Qiu, Andreas Geiger , Jun Zhang, and Hongyang Li. V ista: A generalizable driving w orld model with high fidelity and ver - satile controllability . In The Thirty-eighth Annual Conference on Neural Information Pr ocessing Systems , 2024. 2 [18] Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, and Chen Sun. Force prompting: V ideo generation models can learn and generalize physics-based control signals. In The Thirty-ninth Annual Confer ence on Neural Information Pr ocessing Systems , 2025. 2 , 3 [19] Y uwei Guo, Ce yuan Y ang, An yi Rao, Zhengyang Liang, Y ao- hui W ang, Y u Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In The T welfth In- ternational Confer ence on Learning Repr esentations , 2024. 2 [20] Y utong Hao, Chen Chen, Ajmal Saeed Mian, Chang Xu, and Daochang Liu. Enhancing physical plausibility in video generation by reasoning the implausibility . arXiv pr eprint arXiv:2509.24702 , 2025. 2 , 3 , 5 , 7 [21] Haoran He, Y ang Zhang, Liang Lin, Zhongwen Xu, and Ling Pan. Pre-trained video generati v e models as w orld simulators. 9 In ICLR 2025 W orkshop on W orld Models: Understanding , Modelling and Scaling , 2025. 2 [22] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 W orkshop on Deep Generative Models and Downstr eam Applications , 2021. 3 , 5 [23] Jonathan Ho, W illiam Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Ale xe y Gritsenk o, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv pr eprint arXiv:2210.02303 , 2022. 3 [24] Jonathan Ho, T im Salimans, Alex ey Gritsenko, W illiam Chan, Mohammad Norouzi, and David J Fleet. V ideo dif fusion models. Advances in neural information pr ocessing systems , 35:8633–8646, 2022. 3 [25] Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, and Seungryong Kim. Large language models are frame-lev el directors for zero-shot text-to-video generation. In F irst W orkshop on Contr ollable V ideo Generation @ICML24 , 2024. 2 [26] W enyi Hong, Ming Ding, W endi Zheng, Xinghan Liu, and Jie T ang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In The Eleventh International Confer ence on Learning Repr esentations , 2023. 2 [27] Siqiao Huang, Jialong W u, Qixing Zhou, Shangchen Miao, and Mingsheng Long. V id2world: Crafting video dif fusion models to interacti ve world models. arXiv preprint arXiv: 2505.14357 , 2025. 2 [28] Sangwon Jang, T aekyung Ki, Jaehyeong Jo, Jaehong Y oon, Soo Y e Kim, Zhe Lin, and Sung Ju Hwang. Frame guid- ance: Training-free guidance for frame-le vel control in video diffusion models. arXiv pr eprint arXiv:2506.07177 , 2025. 2 [29] Y ang Jin, Zhicheng Sun, Ningyuan Li, K un Xu, K un Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Y ang Song, Y adong MU, and Zhouchen Lin. Pyramidal flow matching for ef ficient video generati ve modeling. In The Thirteenth International Confer ence on Learning Repr esentations , 2025. 3 [30] Bingyi Kang, Y ang Y ue, Rui Lu, Zhijie Lin, Y ang Zhao, Kaixin W ang, Gao Huang, and Jiashi Feng. Ho w far is video generation from world model: A physical law perspective, 2025. 1 [31] Lev on Khachatryan, Andranik Movsisyan, V ahram T ade- vosyan, Roberto Henschel, Zhangyang W ang, Shant Nav asardyan, and Humphrey Shi. T ext2video-zero: T ext- to-image dif fusion models are zero-shot video generators. In 2023 IEEE/CVF International Confer ence on Computer V ision (ICCV) , pages 15908–15918, 2023. 2 [32] Doyeon Kim, Donggyu Joo, and Junmo Kim. Ti vgan: T ext to image to video generation with step-by-step e v olutionary generator . IEEE Access , 8:153113–153122, 2020. 2 [33] Y itong Li, Martin Min, Dinghan Shen, David Carlson, and Lawrence Carin. V ideo generation from text. In Pr oceedings of the AAAI confer ence on artificial intelligence , 2018. 2 [34] Y aron Lipman, Rick y T . Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tiv e modeling. In The Eleventh International Confer ence on Learning Repr esentations , 2023. 3 , 5 [35] Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong W ang. Physgen: Rigid-body physics-grounded image-to- video generation. In European Conference on Computer V ision ECCV , 2024. 2 , 3 [36] Y ang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, W enqi Shao, Kai W ang, Zhangyang W ang, and Y ang Y ou. Enhance-a-video: Better generated video for free. CoRR , abs/2502.07508, 2025. 2 [37] W illi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov , Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Y uwei F ang, Aleksei Stoliar , Elisa Ricci, Jian Ren, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthe- sis. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pages 7038–7048, 2024. 3 [38] Saman Motamed, Laura Culp, Ke vin Swersk y , Priyank Jaini, and Robert Geirhos. Do generativ e video models learn physi- cal principles from watching videos? arXiv e-prints , pages arXiv–2501, 2025. 1 [39] Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Y ang, Zhijie Chen, Xiang Li, Jian Y ang, and Y ing T ai. Open vid-1m: A lar ge-scale high-quality dataset for text-to- video generation. In The Thirteenth International Conference on Learning Repr esentations , 2025. 2 [40] W illiam Peebles and Saining Xie. Scalable diffusion models with transformers. In Pr oceedings of the IEEE/CVF inter- national conference on computer vision , pages 4195–4205, 2023. 3 [41] Adam Polyak, Amit Zohar , Andre w Bro wn, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Y ao Ma, Ching-Y ao Chuang, David Y an, Dhruv Choudhary , Dingkang W ang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang W ang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary W illiamson, Matt Le, Matthew Y u, Mitesh Kumar Singh, Peizhao Zhang, Pe- ter V ajda, Quentin Duv al, Rohit Girdhar, Roshan Sumbaly , Sai Saketh Rambhatla, Sam S. Tsai, Samaneh Azadi, Samyak Datta, Sanyuan Chen, Sean Bell, Sharadh Ramaswamy , Shelly Sheynin, Siddharth Bhattacharya, Simran Motw ani, T ao Xu, T ianhe Li, T ingbo Hou, W ei-Ning Hsu, Xi Y in, Xiaoliang Dai, Y ani v T aigman, Y aqiao Luo, Y en-Cheng Liu, Y i-Chiao W u, Y ue Zhao, Y uv al Kirstain, Zecheng He, Zijian He, Albert Pumarola, Ali K. Thabet, Artsiom Sanako yeu, Arun Mallya, Baishan Guo, Boris Araya, Breena Kerr , Carleigh W ood, Ce Liu, Cen Peng, Dmitry V engertsev , Edgar Sch ¨ onfeld, El- liot Blanchard, Felix Juefei-Xu, Fraylie Nord, Jef f Liang, John Hoffman, Jonas K ohler , Kaolin Fire, Karthik Si v akumar , Lawrence Chen, Licheng Y u, Luya Gao, Markos Georgopou- los, Rashel Moritz, Sara K. Sampson, Shikai Li, Simone Parme ggiani, Ste ve Fine, T ara F o wler , Vladan Petro vic, and Y uming Du. Movie gen: A cast of media foundation models. CoRR , abs/2410.13720, 2024. 3 [42] Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu T u, Y ihao Feng, Manli Shu, Honglu Zhou, Anas A wadalla, Jun W ang, et al. xgen-videosyn-1: High-fidelity text-to-video synthesis with compressed representations. In Eur opean Confer ence on Computer V ision , pages 249–265. Springer , 2024. 3 10 [43] Uriel Singer , Adam Polyak, Thomas Hayes, Xi Y in, Jie An, Songyang Zhang, Qiyuan Hu, Harry Y ang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Y aniv T aigman. Make-a-video: T ext-to-video generation without te xt-video data. In The Eleventh International Confer ence on Learning Repr esentations , 2023. 2 [44] Jianlin Su, Murtadha Ahmed, Y u Lu, Shengfeng P an, W en Bo, and Y unfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neur ocomputing , 568:127063, 2024. 4 , 1 [45] Y e T ian, Ling Y ang, Haotian Y ang, Y uan Gao, Y ufan Deng, Jingmin Chen, Xintao W ang, Zhaochen Y u, Xin T ao, Pengfei W an, et al. V ideotetris: T owards compositional text-to-video generation. Advances in Neural Information Pr ocessing Sys- tems , 37:29489–29513, 2024. 2 [46] Ruben V illegas, Mohammad Babaeizadeh, Pieter-Jan Kin- dermans, Hernan Moraldo, Han Zhang, Mohammad T aghi Saff ar , Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: V ariable length video generation from open domain textual descriptions. In International Confer ence on Learning Repr esentations , 2023. 2 [47] Ang W ang, Baole Ai, Bin W en, Chaojie Mao, Chen-W ei Xie, Di Chen, Feiwu Y u, Haiming Zhao, Jianxiao Y ang, Jianyuan Zeng, Jiayu W ang, Jingfeng Zhang, Jingren Zhou, Jinkai W ang, Jixuan Chen, Kai Zhu, Kang Zhao, K eyu Y an, Lianghua Huang, Xiaofeng Meng, Ningyi Zhang, Pandeng Li, Pingyu W u, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, T ao Fang, T ianxing W ang, T ianyi Gui, Tingyu W eng, T ong Shen, W ei Lin, W ei W ang, W ei W ang, W enmeng Zhou, W ente W ang, W enting Shen, W enyuan Y u, Xianzhong Shi, Xiaoming Huang, Xin Xu, Y an K ou, Y angyu Lv , Y ifei Li, Y i- jing Liu, Y iming W ang, Y ingya Zhang, Y itong Huang, Y ong Li, Y ou Wu, Y u Liu, Y ulin Pan, Y un Zheng, Y untao Hong, Y upeng Shi, Y utong Feng, Zeyinzi Jiang, Zhen Han, Zhi-Fan W u, and Ziyu Liu. W an: Open and adv anced large-scale video generativ e models. CoRR , abs/2503.20314, 2025. 1 , 3 , 5 , 6 [48] Jiuniu W ang, Hangjie Y uan, Dayou Chen, Y ingya Zhang, Xiang W ang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint , 2023. 2 [49] Jing W ang, Ao Ma, K e Cao, Jun Zheng, Jiasong Feng, Zhanjie Zhang, W anyuan Pang, and Xiaodan Liang. WISA: W orld simulator assistant for ph ysics-aware text-to-video generation. In The Thirty-ninth Annual Confer ence on Neur al Information Pr ocessing Systems , 2025. 2 , 3 , 5 , 7 [50] Y aohui W ang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Y i W ang, Ceyuan Y ang, Y inan He, Jiashuo Y u, Peiqing Y ang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer V ision , 133(5):3059–3078, 2025. 2 [51] Brandon T Willard and R ´ emi Louf. Efficient guided generation for large language models. arXiv preprint arXiv:2307.09702 , 2023. 4 [52] Chenfei W u, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Y ang, Guillermo Sapiro, and Nan Duan. Godiv a: Generating open-domain videos from natural descriptions. arXiv pr eprint arXiv:2104.14806 , 2021. 2 [53] Chenfei W u, Jian Liang, Lei Ji, Fan Y ang, Y uejian Fang, Daxin Jiang, and Nan Duan. N ¨ uwa: V isual synthesis pre- training for neural visual world creation. In European Con- fer ence on Computer V ision , pages 720–736, 2022. 2 [54] Jay Zhangjie W u, Y ixiao Ge, Xintao W ang, Stan W eixian Lei, Y uchao Gu, Y ufei Shi, W ynne Hsu, Y ing Shan, Xiaohu Qie, and Mike Zheng Shou. T une-a-video: One-shot tuning of image dif fusion models for text-to-video generation. In Pr oceedings of the IEEE/CVF international confer ence on computer vision , pages 7623–7633, 2023. 2 [55] T ian Xia, Xuweiyi Chen, and Sihan Xu. Unictrl: Improving the spatiotemporal consistency of te xt-to-video dif fusion mod- els via training-free unified attention control. T ransactions on Machine Learning Resear c h , 2024. 2 [56] Qiyao Xue, Xiangyu Y in, Boyuan Y ang, and W ei Gao. Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation. CoRR , abs/2412.00596, 2024. 2 , 3 , 7 [57] Xin Y an, Y uxuan Cai, Qiuyue W ang, Y uan Zhou, W enhao Huang, and Huan Y ang. Long video diffusion generation with segmented cross-attention and content-rich video data curation. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pages 3184–3194, 2025. 2 [58] Xindi Y ang, Baolu Li, Y iming Zhang, Zhenfei Y in, Lei Bai, Liqian Ma, Zhiyong W ang, Jianfei Cai, Tien-Tsin W ong, Huchuan Lu, et al. Vlipp: T ow ards physically plausible video generation with vision and language informed physical prior . arXiv pr eprint arXiv:2503.23368 , 2025. 2 , 3 [59] Zhuoyi Y ang, Jiayan T eng, W endi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Y uanming Y ang, W en yi Hong, Xiaohan Zhang, Guanyu Feng, Da Y in, Y uxuan.Zhang, W eihan W ang, Y ean Cheng, Bin Xu, Xiaotao Gu, Y uxiao Dong, and Jie T ang. Cogvideox: T e xt-to-video dif fusion models with an e xpert transformer . In The Thirteenth International Conference on Learning Repr esentations , 2025. 3 [60] Jiwen Y u, Y iran Qin, Xintao W ang, Pengfei W an, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with gen- erativ e interacti ve videos. CoRR , abs/2501.08325, 2025. 2 [61] Shenghai Y uan, Jinfa Huang, Y ujun Shi, Y ongqi Xu, Rui- jie Zhu, Bin Lin, Xinhua Cheng, Li Y uan, and Jiebo Luo. Magictime: Time-lapse video generation models as metamor- phic simulators. IEEE T ransactions on P attern Analysis and Machine Intelligence , 47(9):7340–7351, 2025. 2 [62] Xin Y uan, Jinoo Baek, K eyang Xu, Omer T o v , and Hongliang Fei. Inflation with diffusion: Efficient temporal adaptation for text-to-video super-resolution. In Pr oceedings of the IEEE/CVF winter confer ence on applications of computer vision , pages 489–496, 2024. 2 [63] Y e Y uan, Jiaming Song, Umar Iqbal, Arash V ahdat, and Jan Kautz. PhysDiff: Physics-Guided Human Motion Dif fusion Model . In 2023 IEEE/CVF International Conference on Computer V ision (ICCV) , pages 15964–15975, Los Alamitos, CA, USA, 2023. IEEE Computer Society . 2 , 3 [64] Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Y uqian Y uan, Guanzheng Chen, Sicong Leng, Y uming Jiang, Hang Zhang, Xin Li, Peng Jin, W enqi Zhang, Fan W ang, Lidong Bing, and Deli Zhao. V ideollama 3: Frontier multi- modal foundation models for image and video understanding. CoRR , abs/2501.13106, 2025. 6 11 [65] Ke Zhang, Cihan Xiao, Y iqun Mei, Jiacong Xu, and V ishal M. Patel. Think before you diffuse: Llms-guided physics-a ware video generation, 2025. 2 , 3 [66] Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng W an, Junchi Y an, and Y u Cheng. V ideo- REP A: Learning physics for video generation through rela- tional alignment with foundation models. In The Thirty-ninth Annual Confer ence on Neural Information Pr ocessing Sys- tems , 2025. 2 , 3 , 7 [67] Guangcong Zheng, Jianlong Y uan, Bo W ang, Hao yang Huang, Guoqing Ma, and Nan Duan. Frame-le vel captions for long video generation with complex multi scenes. arXiv pr eprint arXiv:2505.20827 , 2025. 2 [68] Daquan Zhou, W eimin W ang, Hanshu Y an, W eiwei Lv , Y izhe Zhu, and Jiashi Feng. Magicvideo: Ef ficient video generation with latent diffusion models. arXiv pr eprint arXiv:2211.11018 , 2022. 2 12 PhysV id: Physics A war e Local Conditioning f or Generativ e V ideo Models Supplementary Material 6. Additional Inf ormation W e be gin with preliminaries, wherein we provide a brief ov ervie w of RoPE followed by a procedural description of the chunk aware cross-attention mechanism to augment the discussion in Sec. 3 . Subsequently , we supplement Sec. 5 with a discussion of works closely related to PhysV id fol- lowed by a discussion on its limitations and future opportu- nities for contribution. 6.1. Rotary Positional Embeddings (RoPE) RoPE is an established method for encoding positional in- formation within transformer-based models that uniquely captures both absolute and relati v e positional data through vector rotations [ 44 ]. Fundamentally , the goal is to apply a set of block-diagonal rotation matrices R to the query vec- tors q and the ke y vectors k at each position. Thus, the transformations for the d dimensional query vector q m and the key v ector k n at the positions m and n , respectiv ely , are q ′ m = R m q m k ′ n = R n k n (3) where R m and R n are the corresponding block-diagonal rotation matrices consisting of d/ 2 blocks. Each diagonal block R m,i in R m corresponds to dimensions 2 i − 1 , 2 i and is defined as R m,i =  cos( mθ i ) − sin( mθ i ) sin( mθ i ) cos( mθ i )  (4) where θ i is a predefined frequency term. Since R T m R n = R n − m , this design ensures that the inner product of the rotated query and key vectors ( q ′ m ) T k ′ n depends only on the original query and key vectors and their relati ve distance, n − m . 6.2. Chunk aware cr oss attention pr ocedur e In PhysV id, the chunk-wise positional alignment between visual query tokens and textual key tokens within the local pathway is achieved through a coherent grid-based encod- ing scheme. The procedure is described in Algorithm 1 . Specifically , the text tokens from all video segments are con- catenated and introduced into the local attention pathways as illustrated in Fig. 3 . T o preserv e and utilize the conte xtual position of each token, a two-dimensional coordinate grid is imposed on the set of text ke y tokens. W ithin this grid, the first dimension index es the corresponding video chunk, while the second dimension identifies the intra-chunk posi- tion. RoPE is then applied to encode these 2D coordinates in the representation of each key token. This design ensures that global and local positional information is preserved for each token throughout the network. Thus, the subsequent cross-attention mechanism can attend to localized content across all chunks, while maintaining precise chunk-specific referencing and temporal awareness. Algorithm 1 Chunk A w are Cross-Attention Require: V ideo tokens X ∈ R B × L v × H × d ▷ L v : video sequence length Require: Local text representations { T ( b ) } N b b =1 for N b video chunks Require: V ideo grid G v ∈ N B × 3 , RoPE frequencies Ω Require: Number of chunks N b , per-chunk te xt length L c Ensure: Updated video representation ˆ X after chunk aware cross-attention // 1. Concatenate local text across all chunks 1: T ← Concat( T (1) , T (2) , . . . , T ( N b ) ) ▷ Single sequence of length L t = N b · L c // 2. Build 2D grid over local text tokens. Initialize G t ∈ N B × 3 as follo ws: 2: G t [: , 0] ← N b ▷ first grid dimension = chunk inde x 3: G t [: , 1] ← L c ▷ second grid dimension = intra-chunk position 4: G t [: , 2] ← 1 ▷ dummy spatial axis to share ApplyRoPE API for both video and text tokens // 3. Compute query , k ey , and value repr esentations 5: Q ← Pro jectAndNormalize Video( X , W q ) ▷ V ideo queries 6: K ← Pro jectAndNormalize T ext( T , W k ) ▷ Local text ke ys 7: V ← Pro ject T ext( T , W v ) ▷ Local te xt v alues // 4. Apply RoPE using video and text grids 8: e Q ← ApplyRoPE( Q, G v , Ω) ▷ Encode video tokens with (frame, height, width) positions 9: e K ← ApplyRoPE( K , G t , Ω) ▷ Encode te xt tokens with (chunk, intra-chunk) positions // 5. Multi-head cross-attention ov er all concatenated chunks 10: ˆ X ← MultiHeadAtten tion( e Q, e K , V ) ▷ Attend from each video token to all local text tok ens across chunks 11: return ˆ X ▷ V ideo features updated with chunk-aw are local text information 1 6.3. Related work The proposed work is in line with recent studies that address the limitations of cross-attention mechanisms within gener - ativ e T e xt-to-V ideo ( T2V ) frame works based on Dif fusion T ransformers ( DiT s). A closely related concept is “Seg- mented Cross-Attention” introduced in Presto [ 57 ], where a prompt is divided into sub-captions using a LLM , each aligned to a specific temporal segment of the video. This method is a parameter -free mechanism for generating long- range videos that follow a sequence of narrativ e instructions deriv ed from the main caption. Similarly , DiTCtrl [ 10 ] is a training-free method that enables multi-prompt video gen- eration by controlling attention to create smooth transitions between dif ferent textual conditions o ver time. These meth- ods aim to improv e narrati ve coherence using e xplicit sub- prompts that are generated purely from text or are explicitly provided, while still relying on modulation of the global attention pathway . Although PhysV id also aligns te xtual information with local temporal segments, its objecti v e and mechanism are distinct. In contrast to these methods, our method does not redesign or modulate the core attention module. Instead, we introduce new , separate cross-attention blocks as a modular addition to a pretrained model, specifi- cally to integrate the chunk-wise generated physics prompts, thereby complementing the global prompt without affecting its attention pathways. 6.4. Limitations and future scope Although video-understanding capabilities in VLM s hav e improv ed significantly in recent years, they are still prone to hallucination and can produce information that is completely incorrect or misaligned with the visual content presented. This fundamental challenge currently limits their ability to reliably extract physics information from longer videos or videos with complex spatiotemporal physical content. Fur - thermore, annotating larger datasets with VLM also requires an additional compute b udget. Another challenge is scal- ability to lar ger models, since the model size can increase quickly due to additional layers in each transformer block. Therefore, an observed impro vement in the physical a ware- ness of the resulting model comes at the cost of slower train- ing and inference over the corresponding baseline. Howe ver , this challenge can be mitigated to some extent with the help of adv anced techniques for faster sampling, such as model distillation. A more theoretical limitation is the classic train- test distrib ution mismatch, since, during inference, VLM does not hav e visual input to generate local annotations and must rely on global te xt alone. Howe ver , the video genera- tor always sees the same interface, which is a sequence of local physics-aw are text prompts. The mismatch therefore lies only in the upstream prompt generation bounded by the consistency with which the VLM maps global descriptions to local physics statements with and without visual input. Our experiments on tw o benchmarks indicate that this does not prev ent rob ust gains in the Physical Commonsense ( PC ) score. Finally , as described in Sec. 3 , while including global T2V prompt in the instruction to the VLM during the anno- tation of a video chunk helps generate annotations that are aligned with the global prompt, the y do not e xplicitly prev ent semantic misalignment of annotations among different video chunks. Future work could e xplore measures to reduce the computational cost of additional local pathw ays and improv e alignment of locally extracted physics information across all chunks. 7. VLM Instructions In this section, we provide the details on the instructions giv en to the VLM for dif ferent use cases. 7.1. Physics gr ounded video chunk annotation Figure 8 sho ws the VLM input instruction used to generate physics grounded annotations for video chunks prior to train- ing. During annotation, the global T2V caption is appended to the VLM instruction along with a contiguous chunk of frames from the input video, as discussed in Sec. 3 . 7.2. Counterfactual annotation Figure 9 sho ws the VLM input instruction used to generate the counterfactual prompt based on incorrect ph ysics. The counterfactual annotation in our method relies only on the generated “positive” local prompt for a gi ven chunk and does not use any other information. As discussed in Sec. 3 , this helps prev ent the generation of physically correct descrip- tions which are undesirable in this phase. 7.3. Physics-gr ounded local prompt generation dur - ing inference During inference, the visual data is not a v ailable. Ho we ver , we still need the local physics based instructions to be pro- vided as input to the model. T o ensure this, we use the VLM instruction shown in Fig. 10 . This instruction relies only on the information contained in the global T2V caption to generate a coherent set of physically correct local prompts. 8. Annotation Examples In Fig. 11 , we visualize the global and local annotations generated by VLM for an example in the training data set, together with the representativ e frames for each chunk. Sim- ilarly , during inference, Fig. 12 we provide the local an- notations generated by the VLM for an example caption, along with the representativ e frames from the video chunks generated using these annotations. 2 Y ou will be provi ded a short vi deo clip taken from a longe r video. In add ition, you wil l also recei ve a caption as input. The ca ption describes the overall event or s cene happening in th e longer video and may contain information that is not visible in the short clip. T he duration of the clip is less than one second. Y our task is to provide a structured description of the physical phenomena grounded in the clip , f ocusing only on VISIBLE elem ents in the cl ip and not on a ny elements that are not v isible. Y ou r descriptio n would be used t o recreate t he sho rt video clip in a physicall y accurate m anner by a downst ream video ge nerator , therefore it needs to be phy sically accu rate and consi stent with t he visible elements in the clip. The inf ormation con tained within the descript ion should onl y be enough to de scribe the ph ysical phenom e na contained within that small tim e segment ( less than a s econd). The descripti on should not c ontain contr adictory sta tements abou t events ob served in the clip. Perform t he following r easoning ste ps: 1. Understan d the given vid eo with the hel p of the accom panying capti on, focusing o n the events ha ppening in sequ ential order. 2. Analyze the visibl e elements in the video, i ncluding obje cts, people, an imals, and en vironmental features. 3. Analyze relevant p hysics observ ations relat ed to VISIBLE element s and how they OBEY physical laws, considering t he followi ng doma ins: - a. Dynami cs (motion, f orces, energy , momentum): un derstand what i s moving, how i t is moving a nd why is it m oving - b. Shape (deformation, elastici ty): understand the shapes of visible objects and if they are deforming or maintaining their shape - c. Optics (illuminat ion, ambienc e, reflecti ons, refracti ons, shadows): understand th e lighting co nditions, ref lections, and shado ws 4. Based on st ep 3, think abou t how these phys ics principl es could be st ructured as a prompt to a video generat or so that it can recreate the video. 5. Structur e your respons e as a JSON stri ng according t o the example below . Only include observa tions that ar e clearly vi sibl e i n the video. Y ou will be REW ARDED for gener ating statem ents that ar e GROUNDED in physics and VERIFIABLE from the video, and PENALIZED for generating statements that ar e incorrect in physics, not verifiable from the vi deo, not relevant to any physical phenomena, o r c opying statemen ts from prom pt. Maximi ze rewards an d minimize penalties. Follow comm ents in the e xample belo w to guide your r easoning: { " visible_elements ": [ "sports car", wh eels, "road", "tr ees", "sunlight ", "shadows", "refl ections"], "thinking": "", //Think about what are the m ost import ant physical properties t hat would help a downstream video generat or recreate the exact same video. Cannot be blank. "physics": "Th e car's speed i s consistent with its mot ion. The road textur e moves backwar d as the car m oves forward . The r otation speed of the wheel s matches t he car's speed on the ground. The car's shape remai ns consistent as it moves . The wheels maintai n thei r circular shape while rota ting. The lighting is consistent with a sunny day . T rees cast shadows on the ground acco rding to the p osition of th e s un. Reflections on the car's surface chan ge as it move s.", //Explain b riefly how th e video obeys ph ysics laws. Cann ot be blank. } Now , l et's analyze t he following video clip al ong with its ca ption given be low . Pr oceed step- by - step as instructed above. Vi d e o c a p t i o n : Figure 8. VLM instruction to generate the physics caption for a video chunk 9. Additional Results 9.1. Qualitative examples T o supplement the examples in Fig. 1 , we visualize additional results generated by our method in Fig. 13 with comparisons to W an-14B. In Fig. 14 , we also provide additional qualita- tiv e results from the ablation study discussed in Sec. 4.3.3 , to supplement Fig. 7 . These examples are also a v ailable as videos along with additional video examples on our project website . 9.2. Similarity metrics W e e v aluated PhysV id on four similarity metrics as sho wn in T ab. 4 . As can be observed, the results remain consis- tent, showing a slightly increased FVD score relative to the finetuned baseline, which corroborates the pre viously noted minor compromise in content fidelity (see T ab . 1 ) in exchange for substantial impro vements in physical realism. Model LPIPS ↓ FVD ↓ SSIM ↑ PSNR ↑ W an 1.3B 0.703 417.352 0.217 8.625 W an (finetuned) 0.671 302.465 0.239 9.379 PhysV id 0.679 318.087 0.240 9.234 T able 4. Additional metrics based on similarity (2048 sample pairs) 10. Other details Configuration. T able 5 lists hyperparameter configurations and other relev ant settings for training and inference. VLM overhead. The average VLM overhead for all prompts generated per sample is ≈ 17 . 22 ± 0 . 99 seconds. Including the denoising loop runtime quantified in T ab . 5 , ov erall, PhysV id approach takes 110 seconds per video, which is near third ( 0 . 35 × ) of 310 seconds per video latency of W an- 14B. All inference run times in our work are reported with bfloat16 precision on a single B200 GPU with a batch size of 1. 3 Y ou will be provi ded a physics- rich descri ption of a sce ne. Perform t he following r easoning ste ps: 1. Identify key elements, objects and environm ental features identifiable from the scene description. 2. Each statement in the input belongs to one of three ca tegories of statements: dynamics, shape, and opti cs. Look at each input statement one - by -one and ide ntify which c ategory it be longs to. 3. As a physicist, pre dict what would happen instea d if physics l aws were NOT obeyed withi n that catego ry . 4. Based on st ep 3, think abou t a video that would result from VIOLA TION of physics laws i n each categor y and generate a desc ription for that video . Ensure that y our descripti on clearly v iolates the p hysics laws wit hin the ident ified categ ory . 5. Structur e your respons e. Y ou will be REW ARDED for generati ng statemen ts that are u nrealistic in physics, and PENALIZED for generatin g statement s that are co rrect in phys ics, not rele vant to any phys ics phenomen a, or copying i nput. Maximi ze rewards and minimize penalties. Example Input: The car's m otion is smo oth. The road texture moves backwar d as the car m oves forward. The rotation speed of the wheels m atche s t he car's speed. The car's sha pe remains c onsistent as it moves. The wheels maintain t heir circula r shape while rotating. The light ing is consistent with a sunny day . T r ees cast sha dows on the groun d according t o the position of the sun. Ref lections on t he car's surf ace change a s i t moves. Example output: { " visible_elements ": [ "car", "wheels", "r oad", "trees", "sun light", "shadows ", "reflection s"], "thinking": "T o descr ibe a video th at violates physics laws, I need to focus on unrealist ic motion dyn amics, shape deforma tions, and incorrect optical effects related to t he car , its wheels , the road, tr ees, sunlight , shadows, and ref lections.", / /Explain br ief l y how a video tha t does NOT follow physi cs laws would lo ok like. Cannot be blank. "physics": "Th e car's speed v aries unreal istically . The road texture m oves forward a s the car mov es forward. The rotat ion speed of the wheels does not match th e car's speed. The car's shape changes significant ly as it mov es. The wheels lose th eir circula r shap e w hile rotating. The lighti ng is inconsis tent with a su nny day . T rees do not cast shadows on the g round accordi ng to the posit ion of the sun . Ref lections on the car's surface remain static as it moves." } Explanatio n of how the outp ut was generat ed: What would h appen if physi cs laws were NOT obeyed wi thin each cat egory: - Dynamics : The car's speed coul d vary unreal istically . The road texture cou ld move forwa rd as the car moves forwar d. The rotati on speed of the wheels m ay not matc h the car's spe ed. The car could be f lying or hover ing above the g round. - Shape: The car's s hape could cha nge as it mov es. The wheels might n ot be circul ar shape while rotating. - Optics: The lig hting would be i nconsistent with a sunny day . T rees will not cast shadows on the g round accordin g to the posit i on of the sun or there m ay be irregu larly shaped s hadows not mat ching the obj ect's shape. Ref lections on t he car's surf ace may rem ain stati c a s it moves. Now , l et's consider the input giv en below STEP- BY - STEP . Input: Figure 9. VLM instruction to generate the counterfactual physics caption 4 Y ou will be provi ded a caption d escribing an event or a sce ne. Y our task is to pr ovide a set of SEVEN captions des cribing the physical phenomena g rounded in the original cap tion. The set of capt ions would be us ed to generate a video that is physicall y accurate an d grounded in the origin al caption. Ea ch caption woul d be used sequen tially by the downstream v ideo generato r to generate a short chunk of vid eo. Each caption th erefore needs to be physica lly accurate and consiste nt with the ori ginal capti on. Each caption should descri be a small time segment of the overall event or scene. The set of captions should together cover the entire eve nt or scene described in the original cap tion leading to a coherent vi deo when stitc hed together. The informati on contained within each ca ption should on ly be enough to describe the ph ysi cal phenomena c ontained with in the corres ponding small time segment (less than a second). A caption should not describe an event that is not possible wi thin the small time se gment, but it can build on e vents from p revious segm ents. Do not des cribe any ele ments that a re not visibl e in the scene o r are imposs ible to visua lize. Do not out put any state ments that directly cont radict the i nput. Perform t he following r easoning ste ps: 1. Imagine a short, few se conds scene ba sed on the capt ion. Identif y key element s, objects and environment al features identif iable from the caption th at may be vis ible in the sc ene. Do not incl ude any elem ents that are not visible. 2. Analyze relevant p hysics observ ations relat ed to these el ements and h ow they OBEY physical laws, consi dering the fol lowing do mains: - a. Dynami cs (motion, f orces, energy , momentum): un derstand what i s moving, how i t is moving a nd why is it m oving - b. Shape (deformation, elastici ty): understand the shapes of objects as mentioned in the ca ption and if they are deforming or maintai ning their shape - c. Optics (illuminat ion, ambienc e, reflectio ns, refracti ons, shadows): understand th e lighting co nditions, ref lections, and shado ws 3. Based on st ep 2, think abou t how these phys ics principl es could be st ructured as a set of seven temporally correlated p rom pt sequence to a video generator so that it can recreate t he scene described in the original input caption. 4. Structur e your respons e as a JSON stri ng according t o the example below . Output only those sta tements tha t are releva nt to th e input. Y ou will be REW ARDED for gene rating stat ements that are GROUNDED in physi cs and the ori ginal input, and PENALIZED for genera ting statements that are incorrect in physics, completely unrelated to the input, not relevant to any phy sical phenomena, or copyi ng statements from inpu t. Maximize rewards and m inimize pen alties. Example Input: A c a r m o v i n g a l o n g t h e r o a d o n a s u n n y d a y. Example Out put (follow t he comments in the code fo r instructi ons on how to gene rate outputs ): { "thinking": "T o descr ibe a video th at follows phy sics laws, I ne ed to focus on realistic m otion dynami cs, shape cons istenc y, a n d a c c u r a t e optical effect s related to the car, its wheels, the roa d, trees, sunli ght, shadows, an d reflection s.", //Think abo ut what are th e most important physical pr operties th at would help a downstream vi deo generator produce the e xact same vi deo. Cannot be bl ank. " visible_elements ": [ "sports car", wh eels, "road", "tr ees", "sunlight ", "shadows", "refl ections"], "physics": [ "The car acc elerates sm oothly along t he road, with i ts speed consi stent with i ts motion.", "The road tex ture moves ba ckward relat ive to the car 's forward mo tion, creati ng a realisti c sense of m ovemen t.", "The wheels r otate at a spe ed that mat ches the car's forward veloc ity , e nsuring prope r traction a nd motion d ynam ics.", "The car ma intains its shape as it m oves, with no vi sible deform ations or al terations.", "The wheels r etain their circular sha pe while rotat ing, demonst rating struc tural integ rity .", "The lighti ng conditions reflect a su nny day , with consistent bri ghtness and co lor throughout the scene.", "T rees cast a ccurate shado ws on the ground based on the su n's position, en hancing the r ealism of t he environ ment." ] //Descri be SEVEN different te mporally cor related capt ions that he lp recreate t he imaginar y video that o beys physical laws. Each caption should describe a small time segment of the overall e vent or scene described in the original caption leadi ng to a cohere nt video when stitched t ogether . Each caption s hould be physi cally accura te and consist ent with the o riginal input and the prece ding captio ns . } Now , l et's analyze t he following caption. Proce ed step- by - step as instructed above. Caption: Figure 10. VLM instruction to generate the local physics captions for all the chunks to be generated at once during inference 5 “The cars are moving forward due to their engines. The streetlight s and buildings do not move because they ar e fixed structures. The Chri stmas lig hts a nd orna ments do not move because they are s tati onary decorat ions. The sn ow is f alli ng due to gr avit y. T he night sky is dark because it is nightti me. The street is wet due to t he snowfall. The car s' headlights and tailli ghts ill umin ate the street by reflecting light off the wet surface .” “The cars are moving forward due to their engines propelling them. The streetlights and bui ldings are stati onary due to their fixed position. The Christmas lights and orname nts are stationary d ue to being attac hed to the streetligh ts and buildings. Th e snow is falling due to g ravity. The night sky is dark due to the absence of su nlight. The street is wet due to the snowfall. The cars' headlights il luminate the road ahead. The str eetlights provide am bient light. The buil dings are lit up by the car ” “The cars are moving forward due to their engines propelling them. The streetlights and bui ldings are stati onary because they are fixed structures. The Christmas lights and orname nts are stationary b ecause they a re attached to the streetlights and bu ildin gs . The snow i s fa llin g due t o gra vity. The st reet is wet due t o the accum ulati on of snow on the gr ound. Th e sky is dar k blu e due to the abs ence of sunlight.” “ A snowy st reet at ni ght with Chri st mas deco rati ons and car s dr ivi ng b y. Th e st ree tli ght s an d bui ldi ngs are deco rat ed wi th Chr ist mas lights and ornaments .” “The cars are moving forward due to their momentum. The streetli ghts and buildings are st ationary due to their fixed position . The Chri stmas lig hts a nd orna ments are s tati onary due to thei r fi xed pos iti on. The snow is falling dow n due to gravity. The ni ght is dark due to th e absenc e of sun light.” “The cars are moving forward along the str eet, which is covered in snow. The streetl ights and buildings ar e stationary, but t he lights are illuminating the scen e. The snow o n the gro und is n ot in motio n, but it is covering the street and side walks. T he Chris tmas lig hts a nd or nament s ar e al so st ati onary, but they are e mit ting lig ht an d cre atin g ref lect ions on t he sno w.” “The cars are moving forward on the street . The streetlight s and buildings are stat ionary. The Christmas light s and ornaments are stationary. The snow is falling downward . The street is wet due to the sn owfall. The cars' hea dlights are illuminating the st ree t. The str eetl ight s are provi ding l ight to t he st reet. The bui ldin gs are decor ated wi th Chr istm as li ghts and or nament s.” “The cars are moving forward on the street . The streetlight s and buildings are stat ionary. The Christmas light s and ornaments are hanging from the streetli ghts and buil dings. The snow is falling down from the sk y.” Figure 11. A sample from the data with annotations generated by a V ision Language Model ( VLM ). The topmost text is the global prompt and the local annotations are listed sequentially alongside a representativ e frame from each chunk. 6 “A car driving on a snowy road.” “The car moves forward on the snowy road, with its ti res gripping the sl ippery surface to maint ain traction.” “The tires do not grip the s lippery surface, and the car m aintains tract ion despite the lack of f riction. The snowy road is solid and not slippery at all. Th e car moves forward effo rtlessly on the snowy road, with its tires floating above the ground.” “The car's speed is consistent wit h its motion, adapti ng to the snowy conditions for saf ety and efficiency.” “The car's speed is inconsistent with its motion, not adapting to the snowy conditions f or safety and efficiency. The snowy con ditions have no effect on the ca r's motion. The car moves at a constant speed regardless of the snowy conditions.” “The wheels rotate at a slower pace due to the i ncreased fricti on from the snow, demonstrating t he impact of the road's texture o n motion dynamics. ” “The wheels rotate at an unrealist ically fast pace due to the increased fri ction from the snow, demonstrat ing the impact of the road's t exture on moti on dynamics. The road's t exture moves backward a s the car moves f orward, defying the expected relationship bet ween the car's motion and the road's text ure.” “The car maintains its shape as it drives, with no vi sible deformati ons or alterations due t o the cold weather.” “The car becomes misshapen or deformed due to t he cold weather. The car's shape changes due to the cold weather. The car shrinks or expands due to the col d temperature.” “The snow on the road appears freshly fallen, wit h no visible signs of mel ting or distur bance, indicating recent snowfall.” “The snow remains perfectly undistur bed and does not melt or change shape despite exposur e to sunlight and temperatu re fluctuatio ns.” “The sky is clear, providing consis tent lighting condi tions that enhance the vis ibility of t he snowy landscape.” “The colors of the sky shift rapidly, creating an other worldly atmosphere. The visibil ity of the snowy landscape varies as the sky changes its appea rance. T he sky' s lighting c onditions are incon sistent, ca using the snowy landscape to be illuminated by differe nt light so urces at d ifferent times .” “The car casts a shadow on the snow, accurately reflect ing the sun's position and t ime of day, adding to the real ism of the scen e.” “The sun's position and time of day ar e inconsistent with t he shadow cast by the car. The car casts an inaccurat e shadow that doe s not refle ct the sun 's position and time of day.” Figure 12. A set of physics grounded and physics counterfactual prompts generated during inference. The representative frames from the generated video chunks are shown on the right. 7 “Honey pours into a cu p of tea.” “Raindrops disturb qu iet puddles.” PhysV id Wa n - 14B PhysV id Wa n - 14B “Skateboard rolls swi ftly over the b umpy sidewalk.” “Hand holds the phone.” PhysV id Wa n - 14B PhysV id Wa n - 14B “W ater gushes from a green garden hose.” “Cheese is grating thro ugh the stainle ss steel grater .” PhysV id Wa n - 14B PhysV id Wa n - 14B “W ater flows freely fr om a fully tur ned faucet.” “An electric beater wh ips cream in a bowl. ” PhysV id Wa n - 14B PhysV id Wa n - 14B Figure 13. Additional comparisons between PhysV id and W an-14B . Captions are from V ideoPhy . 8 Prompt: “A reed diffuser diffusing perfume oil into the room. ” PhysV id PhysV id (without C ounterfactual Prompting) Baseline Finetuned Prompt: “W ooden swing dangles over the sand in the sandpi t.” PhysV id PhysV id (without C ounterfactual Prompting) Baseline Finetuned Prompt: “A wine bottle pours a red bl end into a glass.” PhysV id PhysV id (without C ounterfactual Prompting) Baseline Finetuned Prompt: “Paint swirling in j ar of water .” PhysV id PhysV id (without C ounterfactual Prompting) Baseline Finetuned Figure 14. Supplementary qualitativ e results from the ablation study . All prompts are from V ideoPhy . 9 T able 5. Configurations T raining Base architecture W an-1.3B Additional parameters (M) 400 Effecti ve batch size 64 Number of steps 3000 Number of epochs 4 Learning rate (Stage 1: 1000 steps, frozen base layers) 1 × 10 − 5 Learning rate (Stage 2: 2000 steps, full architecture) 2 × 10 − 6 Loss Flow Matching Optimizer AdamW T imestep Shift Factor 8 Number of Latent Frames per Chunk 3 Number of Latent Chunks 7 Inference Number of Denoising Steps 50 Guidance Scale 6 W an-1.3B Latency per V ideo (s) 66 W an-14B Latency per V ideo (s) 310 PhysV id Latency per V ideo (s) 93 V ideo Resolution 832 × 480 FPS 16 Duration (s) 5 . 06 Frames 81 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment