Sinhala Physical Common Sense Reasoning Dataset for Global PIQA

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper presents the first-ever Sinhala physical common sense reasoning dataset created as part of Global PIQA. It contains 110 human-created and verified data samples, where each sample consists of a prompt, the corresponding correct answer, and a wrong answer. Most of the questions refer to the Sri Lankan context, where Sinhala is an official language.

💡 Research Summary

This paper introduces the first Sinhala‑language physical commonsense reasoning dataset created for the Global PIQA initiative. The dataset consists of 110 human‑authored and verified items, each comprising a prompt (question), a correct answer, and an incorrect answer. Two native Sinhala researchers, both PhDs in NLP, authored at least 50 items each and cross‑validated each other’s work, ensuring high linguistic fidelity.

The items are split into two formats: sentence/paragraph completion (67 items) and direct question‑answer pairs (43 items). They cover nine thematic domains—Buddhism, literature, mythology, sports and games, food, farming/agri/fishery, proverbs, history, and “other” (general daily‑life reasoning). The “other” category includes both universally applicable physical reasoning (e.g., opening a tight bottle cap) and Sri Lankan‑specific cultural knowledge (e.g., rituals in Sinhala Buddhist weddings, the local version of hopscotch called “batto paneema”).

Incorrect answers were generated by (1) altering a single character, (2) substituting 1‑3 words, or (3) swapping words/phrases. This design forces models to detect subtle semantic changes rather than rely on superficial cues. Token‑length analysis, using the SinLLaMA tokenizer, shows that correct and incorrect answers have nearly identical length distributions (average 6‑8 tokens), eliminating length‑based shortcuts. A comparison with a general Sinhala word‑frequency corpus yields a Pearson correlation of only 0.11, confirming that the dataset is domain‑specific and not simply a sample of everyday language.

For evaluation, the authors performed zero‑shot experiments with two models: SinBERT (a Sinhala‑focused BERT variant) and GPT‑5 mini (a multilingual large language model accessed via its public GUI). SinBERT achieved an overall accuracy of 49.09 %, performing better on paragraph‑completion items (≈57 %) than on direct Q‑A items, but its performance dropped sharply for culturally dense questions (e.g., Buddhist ceremony details). GPT‑5 mini attained a higher overall accuracy of 64.5 % but relies on an internal translation step from Sinhala to English before reasoning. Translation errors both harmed and helped: the term “bath kolaya” (a thin polythene rice wrapper) was mistranslated as “banana leaf,” leading to a wrong answer; conversely, “dan” (a native wild fruit) was rendered as “jackfruit seed,” yet the model still produced the correct answer because the mistranslation inadvertently matched the intended reasoning.

A further analysis examined the minimum edit distance between the two answer choices. Most pairs differ by only 1‑4 token edits; as edit distance increases, model accuracy declines, confirming that even minimal textual perturbations significantly confuse current models.

The authors acknowledge limitations: both creators are Sinhala Buddhists, introducing a cultural bias toward Buddhist contexts; minor spelling errors may persist due to the lack of a production‑grade Sinhala spell‑checker; and GPT‑5 mini was evaluated only once, so stochastic variation could affect results.

In conclusion, the paper delivers a valuable, culturally grounded resource for evaluating physical commonsense reasoning in a low‑resource language. The experimental results highlight that state‑of‑the‑art LLMs still struggle with non‑English, culturally specific content, especially when forced to rely on translation pipelines. Future work should expand the dataset, explore multilingual models that reason directly in Sinhala without translation, and develop methods to mitigate cultural bias and spelling inconsistencies.

Sinhala Physical Common Sense Reasoning Dataset for Global PIQA

💡 Research Summary

Comments & Academic Discussion

Leave a Comment