MOSA: Mixtures of Simple Adapters Outperform Monolithic Approaches in LLM-based Multilingual ASR

MOSA: Mixtures of Simple Adapters Outperform Monolithic Approaches in LLM-based Multilingual ASR
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

LLM-based ASR overcomes multilingual data scarcity by projecting speech representations into the LLM space to leverage its robust semantic and reasoning capabilities. However, while previous approaches typically enhance performance by scaling data or model parameters, a single projector often struggles to effectively align representations across different languages. In this work, we propose an MoE-based projector named MOSA (Mixture of Simple Adapters). By aggregating multiple simple adapters, this architecture enables different experts to specialize in learning either language-shared or language-specific knowledge. This approach not only mitigates parameter interference between languages but also facilitates positive transfer from high-resource to low-resource languages, effectively alleviating data scarcity issues. Experimental results demonstrate that MOSA-Base achieves a 15.4% relative reduction in average WER compared to the Ideal-LLM Base, consistently outperforming it across all languages. Notably, MOSA achieves a 13.3% WER reduction over the Ideal-LLM Base while utilizing only 60% of its parameters. These findings highlight MOSA’s superior parameter efficiency and robustness against data imbalance, suggesting that a mixture of simple adapters is more suitable for multilingual LLM-based ASR than complex single-adapter designs.


💡 Research Summary

The paper addresses a critical bottleneck in multilingual automatic speech recognition (ASR) that leverages large language models (LLMs): the projector that aligns speech encoder outputs with the LLM’s input space. Prior work has largely focused on scaling data or model size, often using a single linear or transformer‑based projector. Such monolithic designs struggle to capture both language‑shared acoustic patterns and language‑specific phonetic or lexical nuances, especially under severe data imbalance across languages.

To overcome this, the authors propose MOSA (Mixture of Simple Adapters), a Mixture‑of‑Experts (MoE) projector that aggregates several lightweight adapters. The overall architecture consists of a frozen Whisper‑large‑v3 speech encoder, a frozen Phi‑3‑mini‑4k‑instruct LLM, and the MOSA projector in between. Each adapter is a two‑layer linear network with ReLU activation, dramatically simpler than the transformer‑based adapters used in earlier studies. A router network pools the encoder’s hidden representation over time, applies a softmax, and produces a weight vector that dynamically mixes the outputs of the adapters for each utterance. This design allows some adapters to specialize in language‑agnostic features (e.g., universal phoneme representations) while others capture language‑specific characteristics (e.g., tonal patterns, orthographic conventions).

The authors evaluate MOSA on the Multilingual LibriSpeech (MLS) benchmark, covering eight languages (EN, DE, NL, FR, ES, IT, PT, PL) with a highly skewed data distribution: English provides >44 k hours, while the remaining languages range from 100 to 2 k hours. Two model scales are trained: MOSA‑Base (4 adapters, 0.155 B trainable parameters) and MOSA‑Large (8 adapters, 0.287 B parameters). Both are compared against Ideal‑LLM Base/Large (dual‑encoder with language‑specific weighting) and a LLaMA‑with‑ASR baseline.

Key results:

  • MOSA‑Base achieves an average word error rate (WER) of 7.66 %, a 15.4 % relative reduction over Ideal‑LLM Base (9.05 %).
  • MOSA‑Large further lowers average WER to 7.50 % and shows pronounced gains on low‑resource languages such as Portuguese.
  • MOSA‑Base uses only 60 % of the trainable parameters of Ideal‑LLM Base while delivering superior performance, demonstrating strong parameter efficiency.
  • Ablation studies varying the number of adapters (2–5) reveal that four adapters strike the best balance; performance degrades with five adapters due to over‑parameterization relative to the limited training data.
  • A single‑adapter configuration without a router consistently underperforms, confirming the necessity of both multiple experts and dynamic routing, especially for low‑resource languages.

Qualitative analysis includes t‑SNE visualizations of the aligned speech embeddings, showing clear language‑wise clusters with minimal overlap, indicating that MOSA can treat each language almost as a dedicated monolingual system while still sharing useful features. Router weight distributions further demonstrate that different languages preferentially activate distinct adapters, validating the intended specialization.

The paper’s contributions are threefold: (1) introducing a simple yet effective MoE projector that replaces complex transformer adapters; (2) empirically showing that a mixture of lightweight adapters mitigates multilingual parameter interference and enables positive transfer from high‑resource to low‑resource languages; (3) providing extensive ablations and visual analyses that substantiate the design choices.

Implications for the field include a pathway to more scalable, cost‑effective multilingual ASR systems that can be deployed with limited computational resources. Future work may explore richer router mechanisms (e.g., token‑level routing), language‑conditioned prompts within adapters, or integration with self‑supervised pre‑training to further boost low‑resource performance.


Comments & Academic Discussion

Loading comments...

Leave a Comment