Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs
We address image-to-video generation with explicit user control over the final frame’s disoccluded regions. Current image-to-video pipelines produce plausible motion but struggle to generate predictable, articulated motions while enforcing user-specified content in newly revealed areas. Our key idea is to separate motion specification from appearance synthesis: we introduce a lightweight, user-editable Proxy Dynamic Graph (PDG) that deterministically yet approximately drives part motion, while a frozen diffusion prior is used to synthesize plausible appearance that follows that motion. In our training-free pipeline, the user loosely annotates and reposes a PDG, from which we compute a dense motion flow to leverage diffusion as a motion-guided shader. We then let the user edit appearance in the disoccluded areas of the image, and exploit the visibility information encoded by the PDG to perform a latent-space composite that reconciles motion with user intent in these areas. This design yields controllable articulation and user control over disocclusions without fine-tuning. We demonstrate clear advantages against state-of-the-art alternatives towards images turned into short videos of articulated objects, furniture, vehicles, and deformables. Our method mixes generative control, in the form of loose pose and structure, with predictable controls, in the form of appearance specification in the final frame in the disoccluded regions, unlocking a new image-to-video workflow. Code will be released on acceptance. Project page: https://anranqi.github.io/beyond-visible.github.io/
💡 Research Summary
The paper introduces a training‑free image‑to‑video generation pipeline that gives users explicit control over both articulated motion and the appearance of newly revealed (disoccluded) regions in the final frame. The core contribution is the Proxy Dynamic Graph (PDG), a lightweight, user‑editable directed acyclic graph that abstracts the 3D geometry and motion of objects and their parts. Nodes correspond to rigid or semi‑rigid parts represented as point clouds; edges encode parent‑child relationships together with motion parameters (center, axis, translation/rotation range). Users construct the PDG by first applying off‑the‑shelf depth‑and‑camera estimation (MoGe) and segmentation (SAM2) to the input image, then drawing simple 2‑D bounding boxes around parts. The system lifts the depth map into per‑part point clouds, lets the user define the hierarchy and motion constraints, and finally re‑poses the graph to a target pose.
From the re‑posed PDG the system derives dense optical flow and a time‑varying binary disocclusion mask that marks pixels becoming visible as parts move. This flow is fed, together with the original image and a text prompt, into a pre‑trained image‑to‑video diffusion model (DaS – Diffusion‑as‑Shader). DaS treats the flow as a “tracking video” and denoises a random latent conditioned on the image, the flow, and the prompt, producing a short video where motion follows the PDG while the diffusion prior fills only the disoccluded areas with plausible texture, lighting, and secondary effects.
Crucially, the pipeline allows the user to edit the final frame’s disoccluded content using any external image editor. The edited frame is encoded back into the latent space, and the corresponding latent features of the last frame are swapped into the video’s latent sequence. A second forward pass through DaS (without any weight updates) re‑generates the entire video, now consistent with the user‑specified appearance in the newly revealed regions. Because the replacement happens in latent space, the method avoids pixel‑level seams, preserves identity in unchanged regions, and automatically propagates shadows and reflections that match the edited content.
The approach is entirely training‑free: it relies only on a frozen diffusion model and standard vision tools (depth, segmentation). Consequently it can be applied instantly to diverse categories such as articulated objects, furniture, vehicles, and deformable items, limited only by the diffusion model’s latent manifold. Quantitative and user studies show that the method outperforms recent text‑guided, point/box‑guided, pure flow‑warping, and fine‑tuned edit models in pose fidelity, run‑to‑run stability, and identity preservation.
In summary, the paper presents a novel combination of a proxy articulation graph and latent‑space video diffusion, enabling deterministic, part‑level motion control together with predictable, user‑directed editing of disoccluded regions. This bridges the gap between generative flexibility and precise user intent, opening a new workflow for turning a single image into a controllable short video.
Comments & Academic Discussion
Loading comments...
Leave a Comment