RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation
Simultaneous speech translation (SST) produces target text incrementally from partial speech input. Recent speech large language models (Speech LLMs) have substantially improved SST quality, yet they still struggle to correctly translate rare and domain-specific terminology. While retrieval augmentation has been effective for terminology translation in machine translation, bringing retrieval to SST is non-trivial: it requires fast and accurate cross-modal (speech-to-text) retrieval under partial, continually arriving input, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which tightly integrates cross-modal retrieval into the SST pipeline. RASST trains a lightweight speech-text retriever and performs efficient sliding-window retrieval, providing chunkwise terminology hints to the Speech LLM. We further synthesize training data that teaches the Speech LLM to leverage retrieved terms precisely. Experiments on three language directions of the ACL 60/60 dev set show that RASST improves terminology translation accuracy by up to 16% and increases overall translation quality by up to 3 BLEU points, with ablations confirming the contribution of each component.
💡 Research Summary
Simultaneous speech translation (SST) aims to produce a target-language text incrementally as a source speech stream arrives. Recent advances in speech‑large language models (Speech LLMs) have dramatically improved overall translation quality, yet they still falter on rare, domain‑specific terminology such as technical jargon, proper names, and abbreviations. Human interpreters routinely consult glossaries in real time, and retrieval‑augmented machine translation has proven effective for terminology handling. However, bringing retrieval into SST introduces three unique challenges: (1) the need for fast, accurate cross‑modal (speech‑to‑text) retrieval under streaming constraints; (2) the requirement to keep retrieval latency low enough for real‑time use; and (3) the necessity for the generation model to decide not only whether but also when to inject retrieved terms during incremental decoding.
The paper proposes Retrieval‑Augmented Simultaneous Speech Translation (RASST), a tightly integrated system that addresses all three challenges. RASST consists of three main components: a lightweight dual‑encoder retriever, a data‑synthesis pipeline that creates realistic training examples for both the retriever and the Speech LLM, and a fine‑tuning strategy that teaches the LLM to use retrieved terminology cues appropriately.
Cross‑modal retriever.
The retriever aligns short speech windows with glossary entries. Textual terms are encoded with the BGE‑M3 dense encoder, producing d‑dimensional vectors that are ℓ2‑normalized. Speech windows are processed by the Qwen3‑Omni Audio Transformer; a learned attention‑based pooling layer aggregates frame‑level representations, followed by a linear projection and ℓ2‑normalization to obtain a matching d‑dimensional speech embedding. Retrieval is performed with a sliding‑window scheme: a fixed‑length window (W ≈ 1.92 s) slides over the incoming stream with stride δ (δ < chunk length l). For each window, the top‑K₁ nearest glossary terms are retrieved via cosine similarity using a FAISS index. When a new speech chunk arrives, the system aggregates candidates from all windows that intersect the chunk and keeps the top‑K₂ terms overall. This design avoids re‑encoding the entire history at each step, dramatically reducing computational cost while preserving high recall.
Retriever training data.
To train the dual‑encoder, the authors construct (speech window, term) pairs from the GigaSpeech ASR corpus. They first run the Montreal Forced Aligner to obtain word‑level timestamps, then extract noun phrases with spaCy (en_core_web_trf) as candidate terms, reasoning that most domain‑specific terminology appears as noun phrases. Each 1.92 s window is paired with every noun phrase whose timestamps lie completely inside the window, yielding roughly 4 million aligned pairs. A multi‑positive InfoNCE loss is employed because a single window may correspond to multiple terms. The loss encourages high similarity between a speech embedding and all its positive term embeddings while contrasting against in‑batch negatives. Both encoders are fine‑tuned with LoRA adapters for parameter efficiency.
Teaching the Speech LLM to use retrieved terms.
The translation model is a large speech‑LLM (e.g., a decoder‑only model pre‑trained on massive speech‑text data). Training data are synthesized using the InfiniSST pipeline: source speech is chunked, and target translations are generated incrementally to mimic the lag inherent in SST. Crucially, each chunk is paired with a set of retrieved terms Ĝᵢ. To make the model robust to retrieval errors, three retrieval patterns are mixed during training: (1) Standard: ground‑truth terms plus hard negatives sampled from the retriever’s top‑K₁ results (budget ≤ 20 terms); (2) None: an empty term set, forcing the model to translate purely from speech; (3) All‑Wrong: only incorrect terms, simulating a completely failed retriever. The model is fine‑tuned with LoRA on this synthetic data using a standard cross‑entropy loss applied only to translation tokens. Because the target tokens naturally lag behind the speech input, the model learns a timing signal: it sees when a term appears in the speech context and when its correct translation later emerges, allowing it to learn when to emit the term versus when to wait.
Experimental setup.
Evaluation is performed on the ACL 60/60 development set, which contains five full ACL talks in English with reference translations into Chinese, German, and Japanese. The authors treat each talk as an unsegmented stream to simulate real‑world usage. Two glossaries are used: the official Tagged Glossary supplied with the dataset and a “Paper‑Extracted Glossary” built by extracting terms from the corresponding ACL papers. Metrics include SacreBLEU for overall translation quality, terminology translation accuracy (percentage of reference terms appearing in the hypothesis), and streaming latency measured with Stream‑LAAL alignment.
Results.
RASST achieves up to +3 BLEU points over a strong baseline Speech‑LLM without retrieval, and terminology accuracy improves by 10–16 percentage points across the three language directions. Computational overhead is modest: the sliding‑window retriever adds less than 16 % extra runtime, keeping the system within real‑time constraints. Ablation studies show that removing the retriever collapses terminology accuracy, while training without hard negatives makes the model vulnerable to retrieval noise, reducing BLEU by up to 1.2 points.
Analysis and implications.
The paper demonstrates that non‑parametric, cross‑modal retrieval can be seamlessly incorporated into a streaming translation pipeline. By designing a lightweight dual‑encoder, employing efficient sliding‑window inference, and exposing the LLM to diverse retrieval scenarios during training, RASST balances speed, robustness, and accuracy. The approach opens avenues for further research: scaling to multilingual glossaries, integrating confidence‑weighted retrieval, and combining with interactive interpreter interfaces. Overall, RASST represents a significant step toward practical, terminology‑aware simultaneous speech translation.
Comments & Academic Discussion
Loading comments...
Leave a Comment