Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs
Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnote{https://huggingface.co/datasets/sabdalja/HalluVerse-M3}.
💡 Research Summary
The paper introduces Halluverse‑M³, a new benchmark designed to study hallucinations in large language models (LLMs) across multiple languages and generation tasks. Existing hallucination datasets are largely English‑centric, treat hallucination as a binary label, and focus on a single task, which limits systematic cross‑lingual and cross‑task analysis. Halluverse‑M³ fills this gap by covering four typologically diverse languages—English, Arabic, Hindi, and Turkish—and two generation tasks: factual question answering (QA) and dialogue summarization.
Each data point consists of a human‑verified reference output (answer or summary) and a deliberately hallucinated version created through a controlled editing process. The hallucinations are categorized into three fine‑grained types: (1) Entity‑level errors, where a named entity is swapped with an incorrect but plausible one; (2) Relation‑level errors, where the predicate or attributes linking correct entities are altered; and (3) Sentence‑level errors, where an entirely unsupported statement is inserted. Only a single hallucination is injected per instance, ensuring clear attribution of errors.
The construction pipeline proceeds as follows: (i) source QA data are taken from Lin et al. (2022) and summarization data from DialogSum; (ii) non‑English versions are generated via Google Translate at the sentence or summary level, then manually vetted by native speakers for grammaticality and semantic fidelity; (iii) hallucinated outputs are produced automatically by prompting an LLM with a structured instruction that specifies the desired hallucination type and provides examples, guaranteeing consistency across languages; (iv) two native annotators per language independently label each pair, achieving substantial inter‑annotator agreement (Cohen’s κ ≈ 0.83 overall, 0.74–0.79 per language). After filtering, the final corpus contains 4,038 instances (2,885 QA, 1,153 summarization) with balanced distribution across languages and hallucination types.
Formally, the authors model a text y as a set of atomic propositions P(y). Reference‑consistency is defined as P(y) ⊆ P(y*). A hallucination corresponds to a proposition p⁺ ∈ P(ỹ) \ P(y*) that is not aligned with any reference proposition. Alignment (Align) determines whether p⁺ shares the same fact slot as a reference proposition, which in turn defines the three hallucination categories. This mathematical framing enables a structured prediction task: given (y*, ỹ), a detector fθ must output the hallucination type h ∈ {ENTITY, RELATION, SENTENCE}.
The benchmark is used to evaluate seven contemporary LLMs, including open‑source models (Llama 2‑13B, Mistral‑7B, Falcon‑40B) and proprietary APIs (GPT‑3.5‑Turbo, GPT‑4). Detection is framed as a multi‑class classification problem, and performance is reported using accuracy, macro‑averaged F1, and per‑type F1. Results reveal several consistent patterns:
- Task effect – QA is substantially easier than summarization. Across models, QA macro‑F1 ranges from 0.78 to 0.91, while summarization falls between 0.61 and 0.73. The higher difficulty of summarization stems from longer context and higher abstraction, which obscure the grounding of specific facts.
- Language effect – English consistently yields the highest detection scores (≈ 0.86–0.91 macro‑F1). Performance degrades for Arabic (≈ 0.78–0.84), Turkish (≈ 0.73–0.80), and drops sharply for Hindi (≈ 0.61–0.68). This gradient mirrors the resource availability and pre‑training data volume for each language, highlighting the need for more multilingual pre‑training.
- Hallucination type effect – Entity‑level errors are the easiest to detect (average F1 ≈ 0.80), followed by relation‑level (≈ 0.71), while sentence‑level hallucinations are the hardest (≈ 0.58). Sentence‑level errors often involve completely novel content, which current detectors miss because they rely heavily on surface similarity and token‑level alignment rather than deeper world‑knowledge verification.
Error analysis shows that many failures arise from (a) lexical paraphrasing that masks the altered entity or relation, (b) translation artifacts that unintentionally introduce or remove factual cues, and (c) model bias toward generating plausible‑sounding but unsupported statements, especially in low‑resource languages.
The authors acknowledge several limitations: the reliance on machine translation may propagate subtle meaning shifts; the automatic hallucination generation uses a single LLM, potentially imprinting its own biases into the dataset; and the benchmark currently covers only four languages and two tasks, leaving out many real‑world scenarios such as code generation, medical report generation, or multimodal outputs.
Future work is outlined as expanding the language set (including African and Indigenous languages), adding domain‑specific tasks (e.g., legal reasoning, scientific summarization), integrating human‑in‑the‑loop feedback for hallucination mitigation, and exploring contrastive learning or retrieval‑augmented methods to improve detection, especially for sentence‑level errors.
In summary, Halluverse‑M³ provides a realistic, fine‑grained, and multilingual benchmark that enables systematic evaluation of hallucination detection across tasks and languages. By releasing the dataset, code, and annotation guidelines on HuggingFace, the authors invite the community to develop more robust, language‑agnostic detection and mitigation strategies, moving LLMs closer to trustworthy real‑world deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment