Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

December 17, 2025

Reading time: 5 minute

...

📝 Original Info

Title: Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models
ArXiv ID: 2512.15885
Date: 2025-12-17
Authors: Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi, Shaghayegh Roohi, Mark Granroth-Wilding, Rita Cucchiara

📝 Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.

💡 Deep Analysis

📄 Full Content

Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models Davide Caffagni1 Sara Sarto1 Marcella Cornia1 Lorenzo Baraldi1 Pier Luigi Dovesi2 Shaghayegh Roohi2 Mark Granroth-Wilding2 Rita Cucchiara1 1University of Modena and Reggio Emilia 2AMD Silo AI 1{name.surname}@unimore.it 2{name.surname}@amd.com Abstract Multimodal Large Language Models (MLLMs) have re- cently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental vi- sual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understand- ing primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking vi- sual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual en- hancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language align- ment pipeline of MLLMs training. Our approach lever- ages frozen vision foundation models as context and tar- get encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and se- mantic regularities from images without relying exclusively on language supervision. Extensive experiments on stan- dard MLLM benchmarks show that JARVIS consistently im- proves performance on vision-centric benchmarks across different LLM families, without degrading multimodal rea- soning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS. 1. Introduction The rapid success of Large Language Models (LLMs) [10, 54, 58] has shown the growing need for these models to process and reason across modalities beyond text. This de- mand has led to the emergence of Multimodal Large Lan- guage Models (MLLMs) [11], which convert different in- put modalities into the same embedding space of the LLM, effectively allowing it to understand [3, 30, 64], or even generate [51], other modalities, with particular emphasis Visual Encoder The image depicts a group of horses in a field […] LLM Context Encoder The image depicts a group of horses in a field […] LLM Target Encoder Context Encoder The image depicts a group of horses in a field […] LLM Target Encoder Figure 1. Comparison of LLaVA (top-left), a baseline that aligns the output of a selected layer with the output of a target encoder (top-right), and JARVIS that align the outputs employing a masked predictive loss (bottom-left). We also report the results of JARVIS and LLaVA across three vision benchmarks (bottom-right). on images. Despite impressive progress, the fundamental recipe to design an MLLM has not changed since the in- troduction of visual instruction tuning, originally proposed by LLaVA [15, 32, 36, 37]. LLaVA demonstrates that a lightweight projector can bridge visual and textual modali- ties by aligning the image representations from the vision encoder with the textual embedding space of the LLM. Through this alignment, projected visual features can be ef- fectively interpreted by the LLM, enabling it to reason about images and generate text conditioned on visual content. While this pipeline has proven highly effective across a broad range of tasks, current MLLMs still exhibit notable limitations in surprisingly simple visual reasoning scenar- ios, such as confirming the presence of objects, counting them, understanding their spatial relationships, or estimat- ing their relative distance [18, 56, 57, 63]. The low profi- ciency of current MLLMs in these visual tasks highlights a severe deficit in their visual perception. We believe that 1 arXiv:2512.15885v1 [cs.CV] 17 Dec 2025 this flaw emerges because MLLMs are trained to see images only via their textual descriptions. Indeed, during the align- ment stage proposed by LLaVA, the MLLM is presented with an image, and the learning objective is to generate its caption. Intuitively, if the MLLM can describe an image, then it should have seen it. However, image captions are inherently subjective [13, 48]: they reflect what annotators think are relevant, often omitting details that may be cru- cial from other perspectives. Moreover, it is not practically feasible to assume having access to all possible descriptions of an image. Consequently, an image intrinsically contains richer and more comprehensive information than any subset of its textual descriptions. At the same time, because multi- modal training is relatively modest compared to the massive unsupervised pre-training on textual corpora, MLLMs often over-rely on language priors when reasoning about an im- age, thereby overlooking visual details [8, 16, 61, 67]. With that in mind, in this work we advocate for training MLLMs with the self-supervised signal inherent

📄 Read Full PDF on ArXiv