SLM-S2ST: A multimodal language model for direct speech-to-speech translation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Speech-aware language models (LMs) have demonstrated capabilities in understanding spoken language while generating text-based responses. However, enabling them to produce speech output efficiently and effectively remains a challenge. In this paper, we present SLM-S2ST, a multimodal LM for direct speech-to-speech translation (S2ST), built on the open-source Phi4-MM model. SLM-S2ST extends its predecessor by generating translated speech using an audio transformer head that predicts audio tokens with a delay relative to text tokens, followed by a streaming vocoder for waveform synthesis. Our experimental results on the CVSS-C dataset demonstrate SLM-S2ST’s superior performance, significantly surpassing existing baseline models trained on the same dataset. Furthermore, when we scale up the training data and the model size, SLM-S2ST reaches on-par performance with the current SOTA model.

💡 Research Summary

The paper introduces SLM‑S2ST, a multimodal language model that extends the open‑source Phi4‑MM speech‑aware LLM to perform direct speech‑to‑speech translation (S2ST). The authors keep the pre‑trained speech encoder, adapter, and the shared Transformer backbone of Phi4‑MM frozen, and add two lightweight post‑LM modules: a text post‑LM (identical to the original) and an audio post‑LM that predicts audio tokens. Both modules consume the same hidden states from the shared layer, allowing simultaneous generation of text and audio tokens in a single forward pass. Crucially, audio token generation is delayed by three tokens relative to the text, giving the model a “look‑ahead” window that leverages future text context to improve audio token quality.

Audio tokens are produced by a pre‑trained finite‑scalar‑quantization (FSQ) speech tokenizer (CosyVoice 2). The token stream is grouped into 10‑token chunks; each chunk, together with a speaker prompt, is fed to a causal flow‑matching model to synthesize a mel‑spectrogram, which is then converted to a waveform by a HiFi‑GAN vocoder. This streaming pipeline yields low‑latency, high‑fidelity speech output.

Training is performed in a single fine‑tuning stage. The audio post‑LM consists of six Transformer decoder layers (randomly initialized) and a LoRA module (rank 320) applied only to the original LLM decoder layers (shared layer and text post‑LM). Only the audio post‑LM, LoRA, the audio head, and a linear speech‑out layer are updated; all other parameters remain frozen. Two model sizes are explored: a 4 B‑parameter base and a scaled‑up 7 B version.

Data consists of three parts. The primary benchmark is CVSS‑C (≈ 940 h of source‑target speech pairs synthesized with a single‑speaker TTS). An additional multilingual dataset, CVSS‑M (≈ 940 h), uses zero‑shot TTS to generate multi‑speaker target speech. Finally, the authors augment the training set with roughly 10 000 h of speech‑translation data derived from the Phi4‑MM pre‑training corpus, converting target‑language text to speech with the same TTS pipeline. This yields a total of over 11 000 h when the large‑scale data are included.

Evaluation uses three test suites: CoVoST2, FLEURS, and CVSS. For speech‑to‑text translation (S2TT) the standard BLEU score is reported. For speech‑to‑speech translation (S2ST) the authors adopt the ASR‑BLEU metric: generated speech is transcribed by Whisper‑Large‑V3, normalized, and compared to reference texts. They also compute WER between the model’s generated text and the transcription of its own speech to assess alignment.

Results show that the 4 B model trained only on CVSS‑C outperforms all previously reported S2ST baselines (S2UT, Translatotron 2, DASpeech, UnitY, StreamSpeech) by a large margin, achieving ASR‑BLEU scores of 35.9–39.7 across seven language pairs (DE‑EN, ES‑EN, FR‑EN, IT‑EN, JA‑EN, PT‑EN, ZH‑EN). When scaled to 7 B parameters and the full 11 k‑hour dataset, SLM‑S2ST reaches performance comparable to the current state‑of‑the‑art SeamlessM4T v2 Large, which was trained on millions of hours of multilingual data. The model also maintains low WER between its text and speech outputs, indicating strong text‑speech alignment.

Key insights include: (1) a speech‑aware LLM can be turned into a full S2ST system with minimal additional components, demonstrating the power of large multimodal pre‑training; (2) delaying audio token generation provides useful future context and markedly improves audio quality; (3) a streaming token‑to‑mel‑spectrogram pipeline enables low‑latency, real‑time translation; (4) scaling data and model size yields SOTA‑level results without the massive compute traditionally required for end‑to‑end speech translation.

Limitations are noted: the frozen speech encoder and tokenizer may restrict adaptation to new domains, accents, or expressive prosody; the approach still relies on external tokenizers and vocoders, which could become bottlenecks for fully integrated training; and the evaluation focuses on a limited set of languages and benchmark datasets. Future work could explore joint fine‑tuning of the encoder and tokenizer, broader language coverage, and more efficient small‑model variants.

In summary, SLM‑S2ST offers a practical, reproducible pathway to end‑to‑end speech‑to‑speech translation by building on an open‑source multimodal LLM, introducing a delayed audio token head, and employing a streaming vocoder. The method achieves state‑of‑the‑art performance with modest resources and opens the door for the wider research community to develop advanced speech translation systems.

SLM-S2ST: A multimodal language model for direct speech-to-speech translation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment