ChatUMM: Robust Context Tracking for Conversational Interleaved Generation

ChatUMM: Robust Context Tracking for Conversational Interleaved Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm, effectively functioning as solvers for independent requests rather than assistants in continuous dialogue. To bridge this gap, we present ChatUMM. As a conversational unified model, it excels at robust context tracking to sustain interleaved multimodal generation. ChatUMM derives its capabilities from two key innovations: an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow, and a systematic conversational data synthesis pipeline. This pipeline transforms a diverse set of standard single-turn datasets into fluid dialogues through three progressive stages: constructing basic stateful dialogues, enforcing long-range dependency resolution via ``distractor’’ turns with history-dependent query rewriting, and synthesizing naturally interleaved multimodal responses. Extensive evaluations demonstrate that ChatUMM achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-guided editing benchmarks, while maintaining competitive fidelity in text-to-image generation. Notably, ChatUMM exhibits superior robustness in complex multi-turn scenarios, ensuring fluid, context-aware dialogues.


💡 Research Summary

ChatUMM addresses a critical gap in current open‑source unified multimodal models (UMMs), which are largely confined to single‑turn interactions and therefore act as isolated solvers rather than conversational assistants. The authors propose a two‑pronged solution: an interleaved multi‑turn training strategy and a systematic conversational data synthesis pipeline.

The interleaved training treats a dialogue as a single serialized stream of text and image tokens, demarcated by special delimiters (|im s|, |im e| for text, |v s|, |v e| for images). These delimiters serve not only as structural markers but also as explicit modality‑switch signals: predicting |v s| after |im e| triggers image generation, while predicting |im s| after |v e| initiates text generation. The model employs a decoder‑only transformer with a Mixture‑of‑Transformers (MoT) architecture, sharing self‑attention across layers while routing tokens to modality‑specific feed‑forward networks (“Generation Expert” for VAE latents, “Understanding Expert” for text and ViT tokens). Generalized Causal Attention ensures that each turn can attend to the full history, but only to clean (noise‑free) visual representations, preserving long‑range dependencies without contaminating the current generation with previous noisy latents.

To overcome the scarcity of high‑quality multi‑turn data, the authors design a three‑stage data synthesis pipeline. First, single‑turn samples are converted into basic stateful dialogues, establishing a minimal context flow. Second, “distractor” turns—unrelated utterances generated by a large language model—are inserted, and the original query is rewritten to be history‑dependent, forcing the model to retrieve relevant information from a noisy history. Third, the pipeline synthesizes interleaved multimodal responses, allowing the final turn to output both text and images in a natural, fluid manner. The pipeline is guided by a four‑dimensional taxonomy that categorizes data by user input modality (text‑only or text‑plus‑image), model output modality (image‑only or text‑plus‑image), dependency type, and depth.

Training optimizes two losses simultaneously: a standard cross‑entropy loss (L_CE) for text generation and special tokens, and a flow‑matching MSE loss (L_MSE) for image generation, which predicts the velocity field that transports noise to data, yielding high‑fidelity images.

Extensive evaluations show that ChatUMM outperforms existing open‑source UMMs such as LLaVA, MiniGPT‑4, and Visual‑ChatGPT on visual question answering, instruction‑guided image editing, and text‑to‑image generation benchmarks. Notably, in complex multi‑turn scenarios (five or more turns), ChatUMM demonstrates markedly lower error rates and superior robustness to distractor turns, confirming the effectiveness of both the interleaved token design and the distractor‑insertion data augmentation. Ablation studies isolate the contributions of each component, revealing that the interleaved delimiters, the generalized causal attention, and the three‑stage data synthesis each provide measurable performance gains.

In summary, ChatUMM introduces a cohesive framework that unifies model architecture, training methodology, and data generation to enable persistent context tracking, long‑range dependency resolution, and accurate intent disambiguation in multimodal dialogues. This work paves the way for next‑generation AI assistants capable of seamless, iterative creative and problem‑solving interactions across text and images.


Comments & Academic Discussion

Loading comments...

Leave a Comment