Simultaneous Speech-to-Speech Translation Without Aligned Data

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.

💡 Research Summary

The paper introduces Hibiki‑Zero, a novel simultaneous speech‑to‑speech translation system that completely eliminates the need for word‑level alignment data, a long‑standing bottleneck in real‑time translation research. Traditional simultaneous translation models rely on either costly human‑annotated interpretation data or synthetic alignments generated by language‑specific heuristics, both of which struggle with non‑monotonic word dependencies and are difficult to scale across languages. Hibiki‑Zero replaces this requirement with a two‑stage training pipeline that uses only sentence‑level aligned data, which can be obtained automatically from punctuation and transcript boundaries, and then refines the model with a reinforcement‑learning (RL) phase that directly optimizes latency while preserving translation quality.

Base Model Construction
The system first encodes both source and target speech waveforms using the pre‑trained Mimi codec, which converts audio into a hierarchy of discrete tokens: the first quantization level captures semantic (meaning) information, while subsequent levels encode increasingly fine‑grained acoustic details. These token streams are modeled jointly by an RQ‑Transformer, a variant of the Transformer that processes tokens along two axes—time and quantization level—allowing simultaneous handling of multiple streams (acoustic, semantic, and text). A special “Inner Monologue” text stream is inserted between the semantic and acoustic levels; during training it contains the ground‑truth transcription of the target speech, and at inference time it provides an intermediate text translation that the model generates on‑the‑fly.

Coarse Alignment without Word‑Level Data
To create training pairs, the authors start from unaligned speech‑translation corpora where each source sentence aligns with a target sentence. They insert artificial silences into the target audio to shift each sentence by a random offset proportional to its duration, thereby breaking strict causality while preserving overall sentence correspondence. For naturalness, a separate TTS model conditioned on the source speaker synthesizes the target audio with these pauses, yielding smoother speech that still respects the coarse alignment.

Reinforcement Learning with Process Rewards
After the base model is trained, Hibiki‑Zero undergoes RL to reduce latency. The authors adapt Generalized Reward‑Based Policy Optimization (GRPO) and design a single reward function based solely on BLEU scores. For each generated partial translation at frame t, the reward is a weighted sum of the BLEU between the partial hypothesis and the reference up to the current sentence, and the BLEU of the final output. This “process reward” captures both incremental quality and final fidelity. Rewards are normalized across a batch of G sampled translations, summed over future frames to form an advantage, and then used in a PPO‑style clipped objective without KL regularization, dramatically lowering memory consumption and avoiding instability seen in prior work.

Experimental Results
The system is evaluated on five X‑to‑English language pairs (including Japanese, Spanish, French, etc.). Across all metrics—BLEU, average latency (Avg‑AL), speaker identity preservation, and mean opinion score (MOS) for naturalness—Hibiki‑Zero outperforms previous state‑of‑the‑art models. Latency reductions exceed 30 % while BLEU gains range from +2 to +3 points. Speaker identity scores approach 0.9, indicating near‑perfect voice transfer, and MOS improvements of 0.2–0.3 reflect noticeably more natural speech. Moreover, the model can be adapted to a new source language with less than 1 000 hours of speech data, demonstrating strong data efficiency and rapid multilingual expansion.

Resources and Impact
The authors release a 45‑hour multilingual benchmark, model checkpoints, and inference code, facilitating reproducibility and future research. By removing the dependency on word‑level alignments and leveraging a simple yet effective BLEU‑based RL signal, Hibiki‑Zero simplifies the engineering pipeline, reduces the need for language‑specific heuristics, and opens the door to scalable, high‑quality simultaneous speech translation for a wide range of languages. This work represents a significant step toward practical, real‑time multilingual communication systems.

Simultaneous Speech-to-Speech Translation Without Aligned Data

💡 Research Summary

Comments & Academic Discussion

Leave a Comment