Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA

Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Moxin 7B is introduced as a fully open-source LLM developed in accordance with the Model Openness Framework, which moves beyond the simple sharing of model weights to embrace complete transparency in training, datasets, and implementation detail, thus fostering a more inclusive and collaborative research environment that can sustain a healthy open-source ecosystem. To further equip Moxin with various capabilities in different tasks, we develop three variants based on Moxin, including Moxin-VLM, Moxin-VLA, and Moxin-Chinese, which target the vision-language, vision-language-action, and Chinese capabilities, respectively. Experiments show that our models achieve superior performance in various evaluations. We adopt open-source framework and open data for the training. We release our models, along with the available data and code to derive these models.


💡 Research Summary

The paper presents a comprehensive open‑source multimodal family built around the newly released Moxin‑7B large language model (LLM). Guided by the Model Openness Framework (MOF), the authors make the entire research pipeline—model weights, training code, datasets, and tokenizer—publicly available, addressing the “open‑washing” problem that plagues many so‑called open models. Moxin‑7B, a 7‑billion‑parameter LLM trained on roughly 1.2 trillion English tokens, serves as the backbone for three specialized variants: Moxin‑VLM (vision‑language), Moxin‑VLA (vision‑language‑action), and Moxin‑Chinese (Chinese‑language enhancement).

Moxin‑VLM adopts the Prismatic VLM framework and fuses two state‑of‑the‑art visual backbones, DINOv2 (low‑level spatial features) and SigLIP (high‑level semantic features from diverse internet images). The visual modules are frozen; a projection layer maps visual embeddings into the LLM’s token space, and both the projection and the LLM are jointly trained on a curated multimodal dataset. This dataset combines the LLaVA v1.5 mixture (558 K image‑text pairs) with 665 K multimodal instruction examples, including synthetic data, VQA, captioning, referring expression, and a small share of language‑only ShareGPT conversations. Training proceeds for two epochs in a single‑stage fashion. Empirical results show that Moxin‑VLM outperforms comparable LLaMA‑7B‑VLM and Mistral‑7B‑VLM baselines by 2–3 percentage points on average accuracy across standard vision‑language benchmarks.

Moxin‑VLA extends the VLM into robotic control. Using the OpenVLA‑OFT (Optimized Fine‑Tuning) recipe, the authors attach an action head that predicts “chunks” of future actions in parallel rather than step‑by‑step, dramatically reducing inference latency and improving temporal coherence. Two training strategies are explored: (1) a large‑scale generalist pre‑training on the Open‑X Embodiment dataset (over 1 M trajectories from 22 robot embodiments) followed by task‑specific fine‑tuning, and (2) direct fine‑tuning from the Moxin‑VLM checkpoint without the generalist stage. Both variants are trained on a single node with 8 × H100 GPUs for roughly 90 k steps (≈2 weeks). The direct‑fine‑tuning route achieves comparable performance to the more expensive pre‑training pipeline, demonstrating that the semantic priors learned by the VLM are sufficient for rapid policy acquisition. Across a suite of manipulation, dexterous, and long‑horizon tasks, Moxin‑VLA surpasses existing OpenVLA and VILA models.

Moxin‑Chinese addresses the limited Chinese coverage of the original Moxin tokenizer. The authors expand the vocabulary with ~57 k Chinese BPE tokens derived from WuDaoCorpus2, WanJuan, and other high‑quality corpora, then continue pre‑training on a mixture of Chinese books, news, and distilled data. This results in a noticeable boost in Chinese‑English translation (BLEU +4–5), Chinese question answering, and summarization benchmarks, placing Moxin‑Chinese ahead of other open‑source Chinese‑enhanced LLMs.

All code, model checkpoints, and data scripts are released on GitHub and HuggingFace, enabling immediate replication and further community development. The paper concludes that a fully transparent, reproducible multimodal ecosystem is feasible and beneficial, and it outlines future directions such as scaling model size, integrating audio/video modalities, and refining efficient fine‑tuning techniques.


Comments & Academic Discussion

Loading comments...

Leave a Comment