MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model

MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a “single-turn” formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This allows us to extract a set of multiple query and target embeddings simultaneously, conditioned on a shared context representation, amplifying the effective batch size and overall training efficiency. Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T), which yields state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks, while markedly enhancing both training efficiency and representation coherence across modalities. Code and M3T are available at https://github.com/naver-ai/muco


💡 Research Summary

MuCo (Multi‑turn Contrastive Learning) tackles two fundamental inefficiencies in current multimodal embedding models built on large multimodal language models (MLLMs). Traditional approaches treat each image‑text pair as an isolated “single‑turn” sample, which (1) ignores the rich contextual relationships among multiple queries that can be derived from the same image, and (2) forces the visual encoder to be executed repeatedly for every pair, making large‑batch contrastive training prohibitively expensive.

The core idea of MuCo is to restructure the training data as a dialogue: for a given image, several query‑answer pairs are arranged sequentially as turns. A special token <|emb|> is placed at the end of each assistant response, allowing the model to emit an embedding for that turn directly. Because the image is encoded only once (in the first turn), all subsequent turns are text‑only and thus computationally cheap. This design yields an effective batch size that is multiplied by the number of turns (k) without a proportional increase in FLOPs.

Formally, for image I and a series of queries {q₁,…,qⱼ} with corresponding positive targets {p₁,…,pⱼ}, MuCo constructs cumulative inputs (\bar q_j = (I, q_1,…,q_j)) and (\bar p_j = (p_1,…,p_j)). The contrastive loss is extended to

\


Comments & Academic Discussion

Loading comments...

Leave a Comment