Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision

Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Unified Multimodal Models (UMMs) have emerged as a promising paradigm that integrates multimodal understanding and generation within a unified modeling framework. However, current generative training paradigms suffer from inherent limitations. We present Semantically-Grounded Supervision (SeGroS), a fine-tuning framework designed to resolve the granularity mismatch and supervisory redundancy in UMMs. At its core, we propose a novel visual grounding map to construct two complementary supervision signals. First, we formulate semantic Visual Hints to compensate for the sparsity of text prompts. Second, we generate a semantically-grounded Corrupted Input to explicitly enhance the supervision of masking-based UMMs by restricting the reconstruction loss to core text-aligned regions. Extensive evaluations on GenEval, DPGBench, and CompBench demonstrate that SeGroS significantly improves generation fidelity and cross-modal alignment across various UMM architectures.


💡 Research Summary

Unified Multimodal Models (UMMs) aim to handle both multimodal understanding and generation within a single sequence‑modeling framework, treating visual tokens similarly to text tokens. Existing training paradigms, however, suffer from two fundamental issues. First, the granularity mismatch: a textual prompt provides only coarse semantic constraints, while visual tokens encode dense spatial details. Consequently, a single prompt can correspond to many plausible images, yet the model is forced to reconstruct a single ground‑truth image, penalizing semantically valid variations. Second, supervisory redundancy: image‑conditioned training (e.g., Reca) uses all visual tokens as “visual hints,” many of which belong to low‑salience background regions, diluting attention. Moreover, random masking ignores semantic importance, allocating reconstruction loss to irrelevant patches.

The paper introduces Semantically‑Grounded Supervision (SeGroS), a fine‑tuning framework that restructures supervision based on explicit text‑image alignment. SeGroS proceeds in three steps. (1) Discriminative Text Token Filtering identifies linguistically salient tokens that also have strong visual counterparts. This is achieved by computing intra‑modal (text‑text) affinity via a self‑attention matrix and inter‑modal (text‑image) affinity via cosine similarity between normalized text and visual embeddings. Tokens scoring high on both criteria are retained. (2) A Visual Grounding Map is built by measuring similarity between the filtered text tokens and each image patch, yielding a grounding score for every visual token. High‑scoring patches are extracted as “Visual Hints,” providing dense, text‑aligned conditioning signals. (3) Using the same grounding map, a Semantically‑Grounded Corrupted Input is constructed: low‑grounded tokens remain unmasked (serving as context), while high‑grounded tokens are masked and must be reconstructed. This forces the reconstruction loss to focus on core semantic regions rather than random patches.

SeGroS does not modify the underlying UMM architecture; it only changes the fine‑tuning data preparation, so inference remains unchanged. Experiments on three benchmarks—GenEval (text‑to‑image generation quality), DPGBench (detail and composition accuracy), and CompBench (complex multimodal reasoning)—show consistent improvements across different UMM backbones (e.g., Harmon, Show‑o). Notably, pruning visual hints to the top 30 % of grounded patches already yields gains, confirming that redundant background tokens hinder learning. The method improves generation fidelity by 0.5–1.0 points on the evaluated metrics while reducing unnecessary supervision.

In summary, SeGroS contributes (1) a fine‑grained grounding mechanism that filters discriminative text tokens and aligns them with image regions, (2) two complementary supervision signals—semantic Visual Hints and a grounded corrupted input—that concentrate learning on text‑aligned visual content, and (3) a lightweight, architecture‑agnostic fine‑tuning recipe that markedly enhances cross‑modal alignment and generation quality in unified multimodal models. Future work may extend this grounding‑driven supervision to other modalities such as video, audio, or 3‑D data.


Comments & Academic Discussion

Loading comments...

Leave a Comment