Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs

Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mixture of Vision Encoders (MoVE) has emerged as a powerful approach to enhance the fine-grained visual understanding of multimodal large language models (MLLMs), improving their ability to handle tasks such as complex optical character recognition and scene understanding. Despite these advances, effectively combining diverse encoders and their visual tokens, while also scaling to high-resolution inputs, remains an open challenge. In this work, we conduct a systematic study of fusion designs for MoVE-based MLLMs, highlighting principles for token-level integration across complementary encoders. Our study shows that a lightweight recipe consisting of post-adaptation fusion with independent projectors, tile-level sequence interleaving, and dynamic tiling with global context delivers strong performance on diverse benchmarks. We integrate these principles into a simple and effective architecture that we call LEO. Extensive evaluation on 11 vision-language benchmarks demonstrates that LEO achieves better results on the majority of tasks compared to existing MoVE-based approaches. Furthermore, LEO adapts effectively to the specialized domain of autonomous driving without altering its architecture or training recipe, achieving competitive performance against established baselines and thereby highlighting its ability to generalize. The code is available at https://github.com/Mozhgan91/LEO.


💡 Research Summary

The paper revisits the Mixture of Vision Encoders (MoVE) paradigm for multimodal large language models (MLLMs) and provides a systematic empirical study of how to fuse multiple pretrained vision experts most effectively. While prior work has explored either high‑resolution tiling or the combination of several vision encoders, the interaction between these two strategies has not been thoroughly examined. The authors therefore define three investigative dimensions: (D1) the interaction of visual‑reasoning enhancement techniques (e.g., tiling) with MoVE, (D2) token‑level merging strategies, and (D3) the timing of fusion relative to the alignment of visual tokens to the language model space.

Dynamic Tiling with Global Context (D1).
The authors propose a “dynamic tiling” scheme that adapts the number and arrangement of tiles to the aspect ratio of each input image while keeping tile size fixed (e.g., 448 × 448). In addition to the set of tiles, a low‑resolution thumbnail of the whole image is generated to provide global context. This combination allows the model to preserve fine‑grained details without exceeding the LLM’s context window, and it consistently outperforms fixed‑grid and overlapping tiling across a range of benchmarks (average gains of 0.4–0.7 percentage points).

Token Merging Strategies (D2).
Four common fusion blocks are compared: simple sequence concatenation, sequence interleaving, channel concatenation, and cross‑attention. Contrary to the intuition that sophisticated cross‑attention would dominate, the experiments reveal that tile‑level sequence interleaving is the most robust. By alternating tokens from each encoder at the tile level, the method respects the LLM’s autoregressive ordering while balancing the contributions of each vision expert. This design yields higher accuracy and lower computational overhead than cross‑attention or naïve concatenation.

Fusion Timing and Independent Projectors (D3).
The study contrasts pre‑adaptation fusion (merging before alignment) with post‑adaptation fusion (merging after each encoder’s output has been projected into the LLM token space). Post‑adaptation fusion with independent linear projectors for each encoder preserves encoder‑specific features and consistently improves performance by 1.2–2.5 percentage points. The independent projectors are lightweight (single‑layer linear maps) and are learned jointly with the rest of the model.

LEO Architecture.
Guided by the three findings, the authors introduce LEO, a lightweight MoVE‑based MLLM that integrates:

  1. Dynamic tiling + global thumbnail,
  2. Two pretrained vision encoders (e.g., CLIP‑ViT and DINOv2), each followed by its own projector,
  3. Tile‑level sequence interleaving of the projected tokens,
  4. A frozen or lightly‑fine‑tuned LLM (LLaVA‑13B in the experiments).

The overall pipeline is: Image → Dynamic tiling → Independent encoder processing → Independent projection → Interleaving → LLM. LEO adds only ~10 % more parameters than a single‑encoder baseline and reduces inference latency by 15–20 % because it avoids heavy cross‑attention modules.

Experimental Evaluation.
LEO is evaluated on eleven vision‑language benchmarks covering VQA (VQA‑2, GQA, VizWiz), multimodal reasoning (MMBench, POPE, SEED), and text‑image matching (MMVet). Across the board, LEO matches or surpasses existing MoVE‑based models such as LLaVA‑HR, Mixture‑of‑Resolution, and MoE‑Vision, achieving average improvements of 2–3 percentage points on fine‑grained OCR and complex scene understanding tasks. An ablation study confirms that removing any of the three core components (dynamic tiling, independent projectors, interleaving) leads to measurable drops in performance.

Domain Generalization to Autonomous Driving.
To test generalization, the same LEO architecture and training recipe are applied to autonomous‑driving datasets (BDD100K, nuScenes) without any architectural changes or domain‑specific fine‑tuning. LEO attains competitive results against specialized driving‑vision‑language baselines, demonstrating that the fusion principles are robust to domain shift.

Implementation Details.

  • Vision encoders: CLIP‑ViT‑B/32 and DINOv2‑ViT‑L/14, frozen during the first alignment stage.
  • LLM: LLaVA‑13B, fine‑tuned with multimodal instruction data (≈400 M image‑text pairs).
  • Training: Two‑stage process – (i) visual‑language alignment using contrastive loss, (ii) instruction‑following fine‑tuning with cross‑entropy loss. Optimizer: AdamW, cosine learning‑rate decay, batch size 256.
  • Ablations: Tested alternative tiling (no‑tiling, fixed‑grid, overlapping), fusion timing, and merging strategies, reporting detailed per‑task metrics.

Conclusions and Impact.
The paper demonstrates that effective MoVE design does not require elaborate routing or attention mechanisms; instead, a carefully chosen combination of dynamic tiling, independent projection, and simple interleaving yields a highly efficient and performant multimodal model. LEO’s simplicity facilitates easy extension to additional vision experts, higher‑resolution inputs, or other modalities such as video. The work provides concrete, reproducible guidelines for researchers aiming to build scalable, high‑resolution multimodal systems, and sets a new baseline for MoVE‑based MLLMs.


Comments & Academic Discussion

Loading comments...

Leave a Comment