SinFoS: A Parallel Dataset for Translating Sinhala Figures of Speech

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Figures of Speech (FoS) consist of multi-word phrases that are deeply intertwined with culture. While Neural Machine Translation (NMT) performs relatively well with the figurative expressions of high-resource languages, it often faces challenges when dealing with low-resource languages like Sinhala due to limited available data. To address this limitation, we introduce a corpus of 2,344 Sinhala figures of speech with cultural and cross-lingual annotations. We examine this dataset to classify the cultural origins of the figures of speech and to identify their cross-lingual equivalents. Additionally, we have developed a binary classifier to differentiate between two types of FOS in the dataset, achieving an accuracy rate of approximately 92%. We also evaluate the performance of existing LLMs on this dataset. Our findings reveal significant shortcomings in the current capabilities of LLMs, as these models often struggle to accurately convey idiomatic meanings. By making this dataset publicly available, we offer a crucial benchmark for future research in low-resource NLP and culturally aware machine translation.

💡 Research Summary

The paper introduces SinFoS, a newly constructed parallel corpus specifically designed for translating Sinhala figures of speech (FoS) into English. The authors compile 2,344 unique Sinhala idioms, proverbs, adages, sayings, and idiosyncratic expressions, drawing 65 % of the material from authoritative Sinhala literary sources (such as the Department of Official Languages, the Dictionary of Proverbs of the Sinhalese, and the classic work “Atheetha Wakyā Deepanya”) and the remaining 35 % from Wikipedia entries. Each entry is richly annotated with five fields: (1) the type of FoS (categorised into five main groups), (2) a “Literal/Visual Image” that strips away abstract symbolism and records only concrete, imageable elements, (3) the corresponding English FoS when an established equivalent exists, (4) a concise “What it really implies” description that captures the deeper, culturally grounded meaning, and (5) optional additional context for disambiguation. Annotation was performed exclusively by native Sinhala speakers who adhered to strict guidelines that prohibited personal inference; instead, they relied on established dictionaries (Merriam‑Webster, Cambridge) and the source texts themselves. This rigorous process yields a high‑quality, culturally faithful dataset that can serve both as a resource for training translation models and as a benchmark for evaluating cultural awareness in NLP systems.

Beyond dataset creation, the authors develop a binary classifier that distinguishes between two broad categories of FoS (idiom versus proverb) using a BERT‑based architecture. The model attains an accuracy of approximately 92 %, demonstrating that the linguistic cues encoded in the annotations are sufficient for reliable automatic type detection. This classifier can be integrated as a preprocessing step in translation pipelines, allowing downstream models to apply type‑specific strategies.

The paper also evaluates several state‑of‑the‑art large language models (LLMs), including GPT‑4, LLaMA‑2, and Claude, on the SinFoS test set. Results reveal systematic shortcomings: LLMs frequently produce literal, word‑for‑word translations that miss idiomatic nuance, especially for metaphoric or culturally bound expressions. When the “Literal/Visual Image” diverges from the surface form, error rates increase sharply, indicating that current models lack robust mechanisms for grounding language in cultural context. The authors argue that these findings underscore the need for dedicated cultural‑aware fine‑tuning or retrieval‑augmented prompting to improve performance on low‑resource, culturally rich languages.

In the discussion, the authors highlight both strengths and limitations of SinFoS. Strengths include (a) being the first multi‑label, parallel FoS corpus for Sinhala, (b) providing explicit cultural metadata that facilitates cross‑lingual mapping and cultural transfer research, and (c) enabling type‑aware preprocessing through the binary classifier. Limitations involve (a) incomplete English equivalents (only about two‑thirds of entries have a mapped English FoS), (b) potential subjectivity in the “What it really implies” field despite strict guidelines, and (c) the modest overall size, which may be insufficient for training large‑scale generative models from scratch.

Future work is outlined along several dimensions: expanding the corpus to additional language pairs (e.g., Sinhala‑Chinese, Sinhala‑Arabic), incorporating multimodal annotations (audio recordings, illustrative images) to capture non‑verbal cultural cues, developing multi‑task learning frameworks that jointly perform FoS detection, type classification, and translation, and experimenting with culturally informed prompting techniques to boost LLM performance on idiomatic content.

In conclusion, SinFoS offers a valuable, culturally nuanced benchmark for low‑resource machine translation research. By making the dataset publicly available, the authors provide the community with a concrete tool to investigate and improve the handling of figurative language in multilingual NLP, paving the way for more culturally competent translation systems.

SinFoS: A Parallel Dataset for Translating Sinhala Figures of Speech

💡 Research Summary

Comments & Academic Discussion

Leave a Comment