DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs

DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Conventional wisdom for selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that appear similar to the target benchmark, such as text-intensive or vision-centric tasks. However, it remains unclear whether such intuitive similarity reliably predicts downstream performance gains. In this work, we take a first step toward answering a practical question: can we estimate the influence of a training dataset on a target benchmark before any training is performed? To investigate this question, we conduct an in-depth analysis of transfer across 14 vision-language datasets spanning 7 diverse tasks. Our results show that intuitive task similarity is an unreliable predictor of transferability, and that generalization depends more on the specific dataset than on its broad task category. Motivated by this finding, we propose DATAPROPHET, a simple and effective training-free metric that combines multimodal perplexity, similarity, and data diversity. Experiments show that DATAPROPHET produces supervision-data rankings that strongly correlate with rankings based on actual post-training performance gains, achieving a Kendall’s tau of 86.0%. Moreover, DATAPROPHET enables better supervision-data selection, yielding up to 6.9% improvement over uniform selection, 1.4% over a state-of-the-art training-based baseline, and 0.2% above oracle selection based on experimental performance. Our code and data will be released.


💡 Research Summary

This paper tackles a fundamental yet under‑explored problem in the development of multimodal large language models (MLLMs): how to predict the influence of a candidate supervision dataset on a target benchmark before any training is performed. The authors begin by questioning the prevailing intuition that datasets “similar” to a target task—e.g., text‑rich versus vision‑centric—will reliably yield larger performance gains. To empirically investigate this, they conduct a large‑scale transfer study using the InternVL3‑2B model as a fixed backbone. Fourteen publicly available vision‑language datasets are selected, covering seven distinct task families (OCR, chart understanding, document understanding, general VQA, spatial reasoning, counting, and map reasoning), with two datasets per family. For each dataset, they fine‑tune the model on a uniform 20 k training samples and evaluate on a uniform 1 k test split for every other dataset, measuring relative improvement Δ = (Aₛₜ – Aₜ)/Aₜ where Aₛₜ is the accuracy after fine‑tuning on source s and evaluating on target t.

The empirical results reveal three surprising findings. First, intuitive task similarity is a poor predictor of transferability: OCR data improve spatial reasoning benchmarks more than chart‑understanding benchmarks, contradicting the expectation that OCR‑related tasks would benefit each other most. Second, data influence is asymmetric; the gain from training on dataset A and testing on B is not equal to the gain from the reverse direction. Third, the decisive factor is not the broad task category but the specific characteristics of each dataset; datasets within the same family often have divergent effects on the same target. These observations collectively demonstrate that simple heuristics based on task taxonomy are insufficient for data selection in MLLMs.

Motivated by these findings, the authors propose DATAPROPHET, a training‑free metric designed to predict the influence of any source‑target pair. DATAPROPHET aggregates four components: (1) multimodal perplexity, computed as the negative log‑likelihood of a pretrained multimodal language model on each sample, capturing how “easy” the data is for the model; (2) textual similarity, measured by cosine similarity between BERT embeddings of source and target captions; (3) visual similarity, measured by cosine similarity between CLIP image embeddings; and (4) data diversity, quantified via clustering‑based entropy or average intra‑cluster distance to assess the variety within a dataset. The final score is a weighted linear combination of these components, with weights tuned on a small validation set.

To evaluate DATAPROPHET, the authors adopt a two‑way ranking protocol. For each target benchmark, they rank all 14 source datasets according to the DATAPROPHET score and compare this ranking to the ground‑truth ranking derived from actual Δ values, using Kendall’s τ (τ_Tgt). Symmetrically, for each source dataset they rank all targets (τ_Src). The overall correlation metric is the average of τ_Tgt and τ_Src. DATAPROPHET achieves a remarkable 86 % Kendall’s τ with the true rankings, indicating strong predictive power. Ablation studies show that multimodal perplexity contributes the most (a 37.3 % increase in τ), followed by visual similarity (23.5 %).

Beyond correlation, the authors demonstrate the practical utility of DATAPROPHET for data selection under a fixed compute budget. By selecting the top‑ranked datasets for each target according to DATAPROPHET, they obtain average performance improvements of 3.4 % over uniform random selection when using real data, and 6.9 % when using synthetic data. Compared to ICONS, a state‑of‑the‑art training‑based data selection method, DATAPROPHET outperforms by 1.4 % (real) and 1.2 % (synthetic). Remarkably, the method even surpasses an “oracle” that selects datasets based on actual post‑training performance by 0.2 %, highlighting the efficacy of a training‑free approach.

The paper’s contributions are threefold: (1) a systematic, controlled empirical analysis that disproves the conventional wisdom of task‑similarity‑driven data selection; (2) the introduction of a simple yet powerful training‑free metric that combines perplexity, multimodal similarity, and diversity to predict data influence; and (3) a demonstration that this metric can guide supervision‑data selection to achieve state‑of‑the‑art performance gains without any model training.

In summary, “DataProphet” provides both a diagnostic lens for understanding why certain multimodal datasets transfer better than others and a practical tool for efficiently curating supervision data in resource‑constrained settings. The work opens avenues for extending training‑free influence prediction to larger model families, richer multimodal modalities (e.g., audio‑visual), and automated pipeline integration, potentially reshaping how the community approaches data curation for next‑generation MLLMs.


Comments & Academic Discussion

Loading comments...

Leave a Comment