Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs

Samir Abdaljalil, Parichit Sharma, Erchin Serpedin et al.

February 06, 2026 Score: 8.0

Interest Score Breakdown

Seismic Impact (30%)

9.0/10

Industry-wide significance

Ecosystem Relevance (70%)

7.0/10

Applicable to your apps

Abstract

Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnote{https://huggingface.co/datasets/sabdalja/HalluVerse-M3}.

Source

arXiv ID: 2602.06920

Download PDF

Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs

Interest Score Breakdown

Abstract

Deep Analysis

How to Use in Your Ecosystem

Source