Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition
Recent advances in LLM-based ASR connect frozen speech encoders with Large Language Models (LLMs) via lightweight projectors. While effective in monolingual settings, a single projector struggles to capture the diverse acoustic-to-semantic mappings required for multilingual ASR. To address this, we propose SMEAR-MoE, a stabilized Mixture-of-Experts projector that ensures dense gradient flow to all experts, preventing expert collapse while enabling cross-lingual sharing. We systematically compare monolithic, static multi-projector, and dynamic MoE designs across four Indic languages (Hindi, Marathi, Tamil, Telugu). Our SMEAR-MoE achieves strong performance, delivering upto a 7.6% relative WER reduction over the single-projector baseline, while maintaining comparable runtime efficiency. Analysis of expert routing further shows linguistically meaningful specialization, with related languages sharing experts. These results demonstrate that stable multi-expert projectors are key to scalable and robust multilingual ASR.
💡 Research Summary
The paper addresses a critical bottleneck in large‑language‑model (LLM) based automatic speech recognition (ASR) pipelines: the lightweight projector that bridges a frozen speech encoder and a frozen LLM. While a single projector works well for monolingual tasks, it struggles in multilingual settings because it must learn to map heterogeneous acoustic patterns to a common textual embedding space. The authors systematically explore three families of projector designs—monolithic, static multi‑projector, and dynamic mixture‑of‑experts (MoE)—and demonstrate that conventional MoE suffers from instability and “expert collapse” when data are limited, as only the selected experts receive gradient updates.
To overcome this, they propose SMEAR‑MoE (Soft Merging of Experts with Adaptive Routing). In SMEAR‑MoE, a shared convolutional downsampler first reduces the encoder output, after which a set of lightweight MLP experts processes the downsampled features. A gating network produces token‑level probabilities, which are averaged across the sequence to obtain a global gate vector (\bar g). Instead of routing each token to a hard‑selected subset of experts (as in traditional top‑k MoE), SMEAR‑MoE computes a virtual expert by taking a weighted average of all expert parameters using (\bar g) as weights. This virtual expert is then applied to the entire input. Consequently, every expert receives a dense gradient proportional to its gate weight, eliminating expert collapse and stabilizing training even in low‑resource multilingual scenarios.
The experimental setup uses a frozen Whisper large‑v3 multilingual encoder and a frozen Gemma‑2‑9B LLM, training only the projector. Four mid‑resource Indic languages—Hindi, Marathi, Tamil, and Telugu—are each provided with roughly 250 hours of training data from IndicVox and IndicSUPERB. Evaluation spans four diverse test sets (Fleurs, IndicTTS, Kathbath, MUCS) covering read, conversational, and crowdsourced speech. The monolithic baseline contains ~18 M parameters; a dense ensemble of language‑specific projectors scales to ~72 M; SMEAR‑MoE uses a shared downsampler (~13 M) plus four MLP experts (~9 M each) for a total of ~53 M parameters, comparable to the baseline in computational cost.
Results show that SMEAR‑MoE achieves the lowest average word error rate (WER) across all benchmarks: 8.2 % versus 11.5 % for the single‑projector baseline (a 28 % relative reduction) and 9.3 % for the dense ensemble. Language‑specific gains are consistent, with the most pronounced improvements on low‑resource subsets. Routing analysis reveals linguistically meaningful specialization: Hindi and Marathi heavily share one expert (reflecting their Indo‑Aryan family and shared Devanagari script), Tamil consistently selects a different expert, while Telugu exhibits a more distributed routing pattern, mirroring its Dravidian lineage and distinct script. This emergent behavior demonstrates that SMEAR‑MoE not only boosts performance but also learns interpretable language relationships without explicit supervision.
In terms of efficiency, SMEAR‑MoE’s real‑time factor (RTF) on an NVIDIA H200 GPU is 0.198, virtually identical to the single‑projector’s 0.196, and substantially lower than the dense ensemble’s 0.243. Thus, the method delivers strong accuracy gains without sacrificing latency or computational budget.
The authors conclude that stabilized multi‑expert projectors are a promising direction for scalable, efficient, and robust multilingual ASR with LLM back‑ends. Future work could explore scaling the number of experts, extending to a larger set of languages and dialects, and integrating external linguistic cues (e.g., language IDs, script information) into the gating mechanism to further enhance cross‑lingual sharing and specialization.
Comments & Academic Discussion
Loading comments...
Leave a Comment