DeCorStory: Gram-Schmidt Prompt Embedding Decorrelation for Consistent Storytelling
Maintaining visual and semantic consistency across frames is a key challenge in text-to-image storytelling. Existing training-free methods, such as One-Prompt-One-Story, concatenate all prompts into a single sequence, which often induces strong embedding correlation and leads to color leakage, background blending, and identity drift. We propose DeCorStory, a training-free inference-time framework that explicitly reduces inter-frame semantic interference. DeCorStory applies Gram-Schmidt prompt embedding decorrelation to orthogonalize frame-level semantics, followed by singular value reweighting to strengthen prompt-specific information and identity-preserving cross-attention to stabilize character identity during diffusion. The method requires no model modification or fine-tuning and can be seamlessly integrated into existing diffusion pipelines. Experiments demonstrate consistent improvements in prompt-image alignment, identity consistency, and visual diversity, achieving state-of-the-art performance among training-free baselines. Code is available at: https://github.com/YuZhenyuLindy/DeCorStory
💡 Research Summary
DeCorStory tackles a fundamental problem in text‑to‑image storytelling: maintaining visual and semantic consistency across multiple generated frames without any model fine‑tuning. Existing training‑free approaches, most notably One‑Prompt‑One‑Story, concatenate all frame‑level prompts into a single long text sequence. While this leverages the contextual power of large language‑vision models, it also creates highly correlated token embeddings for the same subject described under different contexts. The resulting inter‑frame embedding correlation leads to semantic leakage, color bleeding, background blending, and identity drift during diffusion.
The proposed framework introduces three complementary modules that operate entirely at inference time. First, after the standard prompt concatenation (identity prompt P₀ plus N frame‑specific prompts P₁…P_N), the method extracts the matrix X of frame‑level token embeddings. A row‑wise Gram‑Schmidt orthogonalization is then applied to X, producing a decorrelated matrix ˜X. This operation rotates each frame embedding into an orthogonal direction while leaving the identity embedding and special tokens unchanged, thereby preserving overall meaning but dramatically reducing overlap between frames in the shared embedding space.
Second, the decorrelated embeddings are processed by Singular‑Value Reweighting (SVR). For the target frame j, the singular values of the concatenated matrix
Comments & Academic Discussion
Loading comments...
Leave a Comment