MV-S2V: Multi-View Subject-Consistent Video Generation
Existing Subject-to-Video Generation (S2V) methods have achieved high-fidelity and subject-consistent video generation, yet remain constrained to single-view subject references. This limitation renders the S2V task reducible to an S2I + I2V pipeline, failing to exploit the full potential of video subject control. In this work, we propose and address the challenging Multi-View S2V (MV-S2V) task, which synthesizes videos from multiple reference views to enforce 3D-level subject consistency. Regarding the scarcity of training data, we first develop a synthetic data curation pipeline to generate highly customized synthetic data, complemented by a small-scale real-world captured dataset to boost the training of MV-S2V. Another key issue lies in the potential confusion between cross-subject and cross-view references in conditional generation. To overcome this, we further introduce Temporally Shifted RoPE (TS-RoPE) to distinguish between different subjects and distinct views of the same subject in reference conditioning. Our framework achieves superior 3D subject consistency w.r.t. multi-view reference images and high-quality visual outputs, establishing a new meaningful direction for subject-driven video generation. Our project page is available at: https://szy-young.github.io/mv-s2v
💡 Research Summary
The paper “MV‑S2V: Multi‑View Subject‑Consistent Video Generation” introduces a new research problem that goes beyond the limitations of existing Subject‑to‑Video (S2V) methods. Traditional S2V approaches rely on a single reference image per subject, which effectively reduces the task to a sequential pipeline of Subject‑to‑Image (S2I) followed by Image‑to‑Video (I2V). Consequently, they cannot fully exploit the potential of video‑level subject control, especially when a subject needs to be presented from multiple viewpoints.
Problem Definition
MV‑S2V is defined as the task of generating a video that maintains 3‑D‑level consistency with a set of multi‑view reference images for each subject. Given a textual prompt and a collection of reference images (\mathcal{R}={R_1,\dots,R_N}) where each (R_i) contains (M_i) views of subject (i), the goal is to model the conditional distribution (p(V\mid \mathcal{R}, y)) and synthesize a video (V) whose appearance matches the provided views from every angle. This formulation captures the core value of S2V—direct, fine‑grained control over dynamic subject appearance—while eliminating the “S2I + I2V” shortcut.
Data Scarcity and Curation
A major obstacle is the lack of large‑scale training data that contain both multi‑view videos and corresponding reference images. The authors address this with a two‑pronged data strategy:
-
Synthetic Data Pipeline – Leveraging state‑of‑the‑art Image‑to‑Video (I2V) models, they first generate high‑quality object and human assets using a Subject‑to‑Image model (Nano‑Banana). For object‑centric (OC) videos, a controllable I2V model (Uni3C) is used to explicitly dictate camera trajectories, ensuring that the object is shown from many angles. For human‑object interaction (HOI) videos, WAN 2.2 is employed for its strong prompt‑following capability, allowing a person to rotate or manipulate the object smoothly. After video synthesis, a captioning model (T5‑based Taiser2) produces textual descriptions, and Grounded‑SAM segments the main subject to extract multi‑view reference frames. Low‑quality samples are filtered automatically using Gemini 2.5. The pipeline yields 11,804 OC and 10,130 HOI synthetic samples.
-
Real‑World Capture – To mitigate the “copy‑paste” artifacts inherent in synthetic data and to improve photorealism, a small real‑world dataset is collected. Videos and reference images are captured separately for both OC and HOI scenarios (1,724 and 1,514 samples respectively), guaranteeing that the reference views are not directly taken from the training videos.
Model Architecture
The generation backbone is the pre‑trained text‑to‑video diffusion model WAN 2.1. Input processing consists of:
- A 3‑D Variational Auto‑Encoder (VAE) that encodes the target video into a latent tensor (F_v \in \mathbb{R}^{T\times C\times H\times W}). The same encoder processes each reference image into (F_{r,i,m} \in \mathbb{R}^{C\times H\times W}), aligning visual features in latent space.
- A DiT (Diffusion Transformer) that iteratively denoises the latent video conditioned on text embeddings (from a T5 encoder) and visual embeddings via cross‑attention.
Temporally Shifted RoPE (TS‑RoPE)
A central technical contribution is the design of a conditioning mechanism that can distinguish (a) different subjects and (b) different views of the same subject. Prior works concatenate reference images along the frame dimension or overlay them on a single canvas, which conflates these two dimensions. The authors adopt Rotary Position Encoding (RoPE) and introduce a temporal shift: each subject (i) receives a unique offset (\delta_i) applied to the RoPE matrix before it is added to the visual token embeddings. Consequently, embeddings of the same subject but different views are separated by the temporal shift, while embeddings of different subjects are orthogonal due to distinct offsets. This “TS‑RoPE” enables the model to treat multi‑view references as a structured set rather than a flat bag, preventing cross‑subject confusion during generation.
Training and Losses
Training follows the standard diffusion objective on the latent space, with additional regularization to preserve identity across views. The synthetic dataset provides abundant multi‑view supervision, while the real‑world data improves robustness to natural lighting, background variation, and sensor noise.
Evaluation Metrics
To quantify 3‑D subject consistency, the authors introduce three complementary metrics:
- 3‑D Reconstruction Error – Using a neural radiance field (NeRF) built from generated frames, they compute the distance between reconstructed geometry and the ground‑truth geometry implied by the reference views.
- View‑Consistency Similarity – Cosine similarity between feature vectors (extracted by CLIP) of generated frames and their corresponding reference views.
- Human Preference Study – Pairwise comparisons between MV‑S2V outputs and baseline single‑view S2V results, focusing on subject fidelity, smoothness of motion, and overall visual quality.
Results
Across both OC and HOI scenarios, MV‑S2V outperforms strong baselines (including recent single‑view S2V methods) by a substantial margin:
- 3‑D reconstruction error reduced by ~12 % on average.
- View‑consistency similarity increased by ~15 %.
- In human studies, MV‑S2V received an average preference score of 4.3/5 versus 3.6/5 for baselines.
Qualitatively, generated videos display smooth camera or object rotations that faithfully reflect the geometry of the multi‑view references, while preserving the textual description (e.g., color, material).
Contributions Summary
- Problem Formulation – Formal definition of Multi‑View Subject‑Consistent Video Generation, highlighting its distinction from the S2I + I2V pipeline.
- Data Generation – A novel synthetic‑plus‑real data curation pipeline that supplies large‑scale, high‑quality multi‑view video‑reference triples.
- Methodology – The TS‑RoPE conditioning mechanism that cleanly separates cross‑subject and cross‑view information within the diffusion framework.
- Evaluation – New metrics for 3‑D subject consistency and extensive experiments demonstrating superior performance.
Impact and Future Directions
By enabling true 3‑D‑aware subject control, MV‑S2V opens avenues for applications such as product advertising (where a product must be shown from all angles), augmented reality avatars, and virtual cinematography. Future work may explore extending the approach to dynamic backgrounds, integrating explicit pose estimation, or scaling to higher‑resolution video generation.
In summary, the paper presents a comprehensive solution to the previously overlooked challenge of multi‑view subject consistency in video generation, combining innovative data synthesis, a principled conditioning design, and thorough evaluation to set a new benchmark for subject‑driven video synthesis.
Comments & Academic Discussion
Loading comments...
Leave a Comment