Hermes the Polyglot: A Unified Framework to Enhance Expressiveness for Multimodal Interlingual Subtitling

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Interlingual subtitling, which translates subtitles of visual media into a target language, is essential for entertainment localization but has not yet been explored in machine translation. Although Large Language Models (LLMs) have significantly advanced the general capabilities of machine translation, the distinctive characteristics of subtitle texts pose persistent challenges in interlingual subtitling, particularly regarding semantic coherence, pronoun and terminology translation, and translation expressiveness. To address these issues, we present Hermes, an LLM-based automated subtitling framework. Hermes integrates three modules: Speaker Diarization, Terminology Identification, and Expressiveness Enhancement, which effectively tackle the above challenges. Experiments demonstrate that Hermes achieves state-of-the-art diarization performance and generates expressive, contextually coherent translations, thereby advancing research in interlingual subtitling.

💡 Research Summary

The paper tackles the under‑explored task of interlingual subtitling – translating the subtitles of visual media (films, TV series, etc.) into a target language – by leveraging large language models (LLMs) in a multimodal setting. While recent LLMs have dramatically improved general machine translation, subtitle translation presents four distinct challenges: (1) maintaining semantic coherence across short, tightly linked subtitle lines despite limited input token windows; (2) correctly handling pronoun references that differ across languages; (3) accurately translating specialized terminology that appears frequently in visual media; and (4) delivering translations that are not only accurate but also expressive, i.e., natural, vivid, and stylistically appropriate.

To address these issues, the authors propose Hermes, a unified framework composed of three tightly integrated modules:

Speaker Diarization (SD) – This module fuses visual and audio cues to identify which character speaks each subtitle line. Video frames aligned with subtitle timestamps are processed by TalkNet to detect active speakers, and facial embeddings are extracted with CurricularFace. Spectral clustering groups these embeddings into visual speaker clusters. Simultaneously, audio segments are encoded with ERes2NetV2 to obtain timbre embeddings, which are clustered separately. Visual clusters serve as anchors; a voting mechanism assigns the most frequent audio cluster to each visual speaker, producing a representative timbre prototype. For lines where no face is visible (≈35 % of the data), the system computes cosine similarity between adjacent lines, defines group boundaries, and, if the average speaker confidence falls below a threshold, registers a new speaker based on the average timbre of the group. This approach handles unknown numbers of speakers, mitigates audio‑visual asynchrony, and supplies reliable speaker labels for pronoun translation.
Terminology Identification (TI) – Subtitles contain many proper nouns and domain‑specific terms. The authors first use an off‑the‑shelf LLM (Qwen‑Max) in a one‑shot prompting setting to detect terms, their types, and candidate translations from bilingual line pairs in a collected parallel subtitle corpus (named D). Raw outputs are filtered and voted on to produce a clean term‑type‑translation set (T_filter). A prefix‑tree (Trie) indexes these terms for fast lookup in source lines, yielding a terminology dataset (ˆD). This dataset is then used to fine‑tune a dedicated LLM (Qwen2.5‑14B, denoted π_term). During inference, even when only monolingual subtitles are available, π_term can retrieve and translate identified terms consistently, achieving 96.9 % recall and >92 % translation accuracy on a held‑out test set.
Expressiveness Enhancement (EE) – Recognizing that subtitle translation quality is multi‑dimensional, the authors introduce Segment‑wise Adaptive Preference Optimization (SAPO). For each segment, multiple translation candidates are generated by the main translation LLM (π_st). A separate LLM‑as‑Judge evaluates each candidate on three axes: accuracy (faithfulness to source), naturalness (grammatical fluency), and vividness (stylistic richness). The highest‑scoring candidate is selected and appended to the prompt for the next segment, ensuring contextual continuity while progressively optimizing for expressiveness. This multi‑objective optimization departs from traditional BLEU‑centric training and aligns the model’s output with human translator preferences.

Dataset and Evaluation – The authors release a multilingual subtitle corpus harvested from the Chinese video platform Youku, covering several language pairs and genres (fantasy, sci‑fi, period drama, etc.). Evaluation of the SD module reports a Diarization Error Rate (DER) of 7.3 %, surpassing prior multimodal diarization baselines. Terminology translation is measured by term‑level precision/recall and shows substantial gains over generic MT systems. For overall translation, standard metrics (BLEU, METEOR, COMET) improve modestly, but the LLM‑as‑Judge expressiveness scores increase by an average of 3.2 points, indicating more natural and vivid subtitles. Human evaluation corroborates these findings, with annotators preferring Hermes outputs in 78 % of cases.

Contributions – The paper (1) formally defines interlingual subtitling as a multimodal MT problem and enumerates its practical challenges; (2) proposes a comprehensive LLM‑centric framework that integrates speaker information, domain terminology, and expressive optimization; (3) introduces a novel SAPO method and a multi‑dimensional evaluation protocol; and (4) provides a publicly available benchmark for future research.

Limitations and Future Work – The SD performance depends on video resolution and audio quality; low‑quality streams may degrade speaker detection. SAPO’s candidate generation is computationally intensive, posing challenges for real‑time streaming applications. The current experiments focus on a limited set of language pairs; extending to more diverse languages and cultural contexts is an open direction. Moreover, the authors suggest exploring lightweight alternatives for the judge model and investigating end‑to‑end training that jointly optimizes diarization, terminology handling, and translation.

In summary, Hermes demonstrates that by tightly coupling multimodal speaker cues, LLM‑based terminology distillation, and expressive preference optimization, it is possible to produce subtitle translations that are semantically coherent, terminologically accurate, and stylistically engaging—advancing the state of the art in interlingual subtitling.

Hermes the Polyglot: A Unified Framework to Enhance Expressiveness for Multimodal Interlingual Subtitling

💡 Research Summary

Comments & Academic Discussion

Leave a Comment