MC-LLaVA: Multi-Concept Personalized Vision-Language Model

MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To further push the performance upper bound, we incorporate an optional auxiliary loss, better enhancing the proposed personalized prompts. To decorate the VLM personalization research, we contribute a high-quality dataset. We carefully collect images with multiple characters and objects from movies and manually create question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive experiments demonstrate that MC-LLaVA achieves impressive multi-concept personalized responses, paving the way for VLMs to become better user assistants. The code and dataset will be released at \href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}.


💡 Research Summary

The paper introduces MC‑LLaVA, a novel vision‑language model (VLM) designed to handle multi‑concept personalization in a single training step, addressing the limitations of prior work that focuses on single‑concept adaptation. The authors observe that real‑world applications often require a model to understand and respond to several user‑provided concepts simultaneously (e.g., multiple characters in a movie scene). Existing personalization methods either train each concept separately and merge parameters (which leads to performance degradation) or rely heavily on large numbers of high‑quality negative samples, making them costly and brittle in complex settings.

MC‑LLaVA’s core contribution is a multi‑concept instruction‑tuning framework. For a set of m concepts, the model expands its vocabulary by adding a unique identifier token ⟨sks⟩ for each concept and k learnable tokens that encode its semantic attributes. The language‑model classifier’s weight matrix is enlarged from D×N to D×(N+m), where D is the feature dimension and N the original vocab size. Training data are constructed as (image, question, answer) triples, using standard VQA formats (positive recognition, random recognition, conversation) plus a novel “joint recognition” task that creates cross‑concept negative samples automatically, yielding m·(m‑1)·n negatives without manual curation.

To reduce the cost of learning these new tokens, the authors propose a visual‑token‑based initialization. They run Grounded‑SAM on each concept’s images to obtain foreground masks, apply these masks to the visual encoder’s feature maps, and then cluster the masked features with k‑means. The resulting centroids initialize the k concept tokens, accelerating convergence and decreasing dependence on high‑quality negatives.

An optional auxiliary loss further grounds the tokens spatially. Attention weights from the last K transformer layers are averaged over heads and token positions to produce a soft attention mask M_attn for each concept. This mask is aligned with the binary mask from Grounded‑SAM using a differentiable IoU‑like loss L_attn. The final training objective combines the standard next‑token language modeling loss L_LM with λ·L_attn.

During inference, MC‑LLaVA introduces a “personalized visual prompt.” For a test image, patch‑level visual tokens V_t are extracted. A reference similarity map M_ref is computed by averaging cosine similarities between V_t and stored support features for the concept. Simultaneously, a token‑guided map M_token is derived from the dot product between V_t and the learned concept token embedding e_j. The two maps are fused with a weight β to produce M_final, which is normalized and thresholded. If the fused map indicates the presence of a concept, a spatial indicator (“⟨sks⟩ is located at Mark j”) is appended to the system prompt, improving grounding for downstream tasks.

To evaluate the approach, the authors build a new dataset sourced from movie scenes, containing roughly 2,000 images and 16,700 QA pairs. They use GPT‑5 to generate initial QA samples and then manually refine them, ensuring a variety of question types (hair color, presence, activity, detailed captioning) and concept counts (2‑4 per image). Compared with prior single‑concept datasets, this collection offers larger scale, richer annotations, and explicit multi‑concept scenarios.

Experimental results show that MC‑LLaVA outperforms state‑of‑the‑art single‑concept personalization methods (e.g., Yo’LLaVA) across multiple metrics: concept recognition accuracy, visual question answering, and caption generation. The auxiliary attention loss speeds up convergence by about 30 % and improves spatial awareness, while the personalized visual prompt raises location‑prediction accuracy by roughly 12 %. Importantly, the model maintains strong performance even when the proportion of high‑quality negative images is reduced, demonstrating robustness to data scarcity.

In summary, MC‑LLaVA delivers an integrated solution for multi‑concept VLM personalization by (1) jointly training multiple concepts with expanded vocabularies, (2) initializing concept tokens from visual features to cut training time, (3) optionally grounding tokens via attention‑based loss, and (4) enhancing inference with a visual prompt that supplies explicit spatial cues. The work opens avenues for scaling personalization to dozens of concepts, handling inter‑concept relationships, and combining with large‑scale pre‑training to further boost real‑world assistant capabilities.


Comments & Academic Discussion

Loading comments...

Leave a Comment