Benchmarking Machine Translation on Chinese Social Media Texts

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The prevalence of rapidly evolving slang, neologisms, and highly stylized expressions in informal user-generated text, particularly on Chinese social media, poses significant challenges for Machine Translation (MT) benchmarking. Specifically, we identify two primary obstacles: (1) data scarcity, as high-quality parallel data requires bilingual annotators familiar with platform-specific slang, and stylistic cues in both languages; and (2) metric limitations, where traditional evaluators like COMET often fail to capture stylistic fidelity and nonstandard expressions. To bridge these gaps, we introduce CSM-MTBench, a benchmark covering five Chinese-foreign language directions and consisting of two expert-curated subsets: Fun Posts, featuring context-rich, slang- and neologism-heavy content, and Social Snippets, emphasizing concise, emotion- and style- driven expressions. Furthermore, we propose tailored evaluation approaches for each subset: measuring the translation success rate of slang and neologisms in Fun Posts, while assessing tone and style preservation in Social Snippets via a hybrid of embedding-based metrics and LLM-as-a-judge. Experiments on over 20 models reveal substantial variation in how current MT systems handle semantic fidelity and informal, social-media-specific stylistic cues. CSM-MTBench thus serves as a rigorous testbed for advancing MT systems capable of mastering real-world Chinese social media texts.

💡 Research Summary

The paper addresses a critical gap in machine translation (MT) research: the lack of benchmarks and evaluation metrics that reflect the informal, rapidly evolving language found on Chinese social media platforms. While large‑scale MT benchmarks such as WMT and FLORES focus on formal, edited texts (news, Wikipedia), they do not capture the slang, neologisms, and highly stylized expressions that dominate user‑generated content. To fill this void, the authors introduce CSM‑MTBench, a multilingual benchmark covering five target languages (Spanish, French, Japanese, Korean, Russian) for Chinese‑to‑foreign translation.

CSM‑MTBench consists of two complementary subsets:

Fun Posts – longer, narrative‑style user posts (average 41 Chinese characters) that are rich in context, events, personal experiences, and, crucially, slang and neologisms. About 52 % of the 1,183 Fun Posts contain at least one non‑standard expression.
Social Snippets – short, often single‑sentence comments or reactions (average 10 characters) where the primary communicative goal is to convey emotion, attitude, or a specific tone rather than factual content.

The benchmark is built from real data harvested from a Chinese platform (Xiaohongshu). After rigorous filtering, bilingual experts manually translated each source sentence into the five target languages, ensuring high‑quality reference translations.

Novel Evaluation Methods

Fun Posts – Slang Success Rate (SSR)
The authors propose a dedicated metric for slang/neologism preservation. Using GPT‑5, they automatically identify all slang/neologism tokens in each Chinese source and locate their corresponding translations in the gold reference. For each token, GPT‑5 generates a set of plausible target‑language candidates, which are then vetted by human annotators. The final candidate set (C^+) includes the gold translation plus GPT‑generated alternatives. Model outputs are compared against (C^+) using a fuzzy‑matching function (RapidFuzz). If any candidate matches above a predefined similarity threshold, the token is counted as correctly translated. SSR is the average of these binary scores across all slang tokens in the dataset. This metric directly measures whether a model can reproduce non‑standard lexical items, something BLEU, chrF, or COMET cannot capture.

Social Snippets – Embedding Similarity (ES) + LLM‑as‑Judge
Because short snippets lack identifiable slang, the authors evaluate style, emotion, and sentiment preservation. They employ three pre‑trained embedding models: a style encoder (Patel et al., 2025), an emotion encoder (Poulaei et al., 2025), and a sentiment encoder (Tabularisai et al., 2025). For each source‑translation pair, cosine similarity is computed for each embedding type; the ES score is the average of the three similarities. To complement this automatic measure, they adapt the GEMBA‑Stars prompting framework to let a large language model (again GPT‑5) act as a judge, explicitly asking whether the translation retains the source’s tone and stylistic nuances. This hybrid approach provides both quantitative similarity scores and a more human‑like qualitative assessment.

Experimental Findings

The authors evaluate 22 models, spanning closed‑source commercial APIs (GPT‑4o, GPT‑5, Claude‑Sonnet‑4), open‑source general‑purpose LLMs (DeepSeek‑V3, GPT‑OSS‑120B, Aya‑Expanse, Gemma‑IT series, Qwen‑3 series), and translation‑specialized systems (NLLB‑3.3B, Aya‑101, Hunyuan‑MT‑7B, GemmaX2‑9B).

Key observations:

Overall quality – Large commercial LLMs achieve the highest XCOMET and BLEU scores on Fun Posts, indicating strong semantic fidelity on longer texts.
Slang handling – Despite high overall scores, SSR for these models hovers around 55‑65 %, revealing that slang and neologisms are frequently mistranslated, omitted, or replaced with generic equivalents. Smaller open‑source LLMs show even lower SSR, often below 40 %.
Style & emotion – On Social Snippets, ES scores are modest across the board (0.45‑0.62). LLM‑as‑judge ratings echo this trend, with many models receiving “tone not preserved” judgments, especially for highly emotive or sarcastic comments.
Prompt engineering – Simple style‑preserving prompts (e.g., “Translate while keeping the original humor and emotion”) improve ES by 3‑5 % for some models, but the gains are limited, suggesting that architectural or data‑centric solutions are needed.
Translation‑specialized models – Systems like Aya‑101 and Hunyuan‑MT, which are fine‑tuned on parallel data, perform more consistently across both metrics, though they still lag behind the best LLMs in overall semantic quality.

Contributions & Impact

Dataset – CSM‑MTBench provides over 10 k high‑quality, human‑translated Chinese‑to‑foreign pairs covering diverse informal language phenomena.
Metrics – SSR and the combined ES + LLM‑as‑judge framework constitute the first systematic attempts to evaluate non‑standard lexical preservation and stylistic fidelity in MT.
Empirical insights – The extensive model comparison demonstrates that current MT technology, even state‑of‑the‑art LLMs, is not yet robust to the linguistic variability of real‑world Chinese social media.
Open resources – All code, annotation guidelines, slang dictionaries, and evaluation scripts are released on GitHub, enabling reproducibility and future research.

In summary, the paper delivers a much‑needed benchmark that aligns MT evaluation with the realities of user‑generated content on Chinese social platforms. By exposing the shortcomings of existing systems and offering concrete evaluation tools, CSM‑MTBench paves the way for future models that can faithfully translate slang, neologisms, and nuanced emotional tones—an essential step toward truly universal, context‑aware machine translation.

Benchmarking Machine Translation on Chinese Social Media Texts

💡 Research Summary

Novel Evaluation Methods

Experimental Findings

Contributions & Impact

Comments & Academic Discussion

Leave a Comment