LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 22M instruction dataset LLaVA-OneVision-1.5-Instruct. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision-1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. (4) RL-based Post-training: We unlock the model’s latent potential through a lightweight RL stage, effectively eliciting robust chain-of-thought reasoning to significantly boost performance on complex multimodal reasoning tasks.


💡 Research Summary

The paper introduces LLaVA-OneVision-1.5, a comprehensive and fully open-source framework designed to democratize the development of high-performance Large Multimodal Models (LMMs). It addresses the critical barriers in the field—proprietary models, high computational costs, and lack of reproducibility—by releasing all components necessary to build state-of-the-art vision-language models from scratch.

The core contributions are fourfold. First, the authors construct and release two large-scale, high-quality datasets: LLaVA-OneVision-1.5-Mid-Training, an 85 million image-text pair dataset with a novel concept-balanced sampling strategy that ensures semantic diversity without relying on caption quality, and LLaVA-OneVision-1.5-Instruct, a meticulously curated 22 million sample instruction-tuning dataset covering seven broad categories like captioning, science, and OCR.

Second, they develop an efficient end-to-end training framework. A key innovation is the offline parallel data packing strategy, which pre-packs multiple shorter multimodal samples into single sequences during preprocessing. This drastically reduces the computational overhead from padding tokens, a major source of inefficiency when handling heterogeneous data. This optimization enables training the 8B parameter model within a budget of approximately $16,000.

Third, the resulting models achieve state-of-the-art performance. Extensive evaluations across 27 benchmarks show that LLaVA-OneVision-1.5-8B outperforms the strong baseline Qwen2.5-VL-7B on 18 tasks, spanning general VQA (e.g., MMBench), complex reasoning (e.g., MathVista, MMMU), and OCR/Chart understanding (e.g., DocVQA). Remarkably, the smaller 4B model surpasses Qwen2.5-VL-3B on all 27 benchmarks, demonstrating exceptional efficiency.

Fourth, the authors introduce a lightweight RL-based post-training stage using the asynchronous AReal system. By employing a discrepancy-driven data selection strategy and rigorous outcome verification, this stage effectively elicits the model’s latent chain-of-thought reasoning capabilities. The RL-enhanced model shows significant performance boosts, particularly on demanding multimodal reasoning tasks.

Architecturally, LLaVA-OneVision-1.5 adopts a “ViT-MLP-LLM” structure. It leverages the RICE-ViT vision encoder, which utilizes a unified region cluster discrimination loss for superior region-aware and OCR capabilities, and uses Qwen3 as the language model backbone. The training follows a three-stage pipeline: language-image alignment, high-quality knowledge learning (full-parameter training on the mid-training dataset), and visual instruction tuning.

In summary, LLaVA-OneVision-1.5 provides a complete, open, and cost-effective blueprint for building competitive LMMs. By publicly releasing all datasets, training code, infrastructure optimizations, and model weights, it significantly lowers the barrier to entry for cutting-edge multimodal AI research and development.


Comments & Academic Discussion

Loading comments...

Leave a Comment