AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

arXiv · September 4, 2025 · Significant research

Summary

The paper introduces AraHalluEval, a new framework for evaluating hallucinations in Arabic and multilingual large language models (LLMs). The framework uses 12 fine-grained hallucination indicators across generative question answering and summarization tasks, evaluating 12 LLMs including Arabic-specific, multilingual, and reasoning-based models. Results show factual hallucinations are more common than faithfulness errors, with the Arabic model Allam showing lower hallucination rates. Why it matters: This work addresses a critical gap in Arabic NLP by providing a comprehensive tool for assessing and mitigating hallucination in LLMs, which is essential for reliable AI applications in the Arabic-speaking world.

Keywords

hallucination · Arabic LLMs · evaluation framework · AraHalluEval · natural language generation

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic

arXiv · Mar 14

The paper introduces AraTrust, a new benchmark for evaluating the trustworthiness of LLMs when prompted in Arabic. The benchmark contains 522 multiple-choice questions covering dimensions like truthfulness, ethics, safety, and fairness. Experiments using AraTrust showed that GPT-4 performed the best, while open-source models like AceGPT 7B and Jais 13B had lower scores. Why it matters: This benchmark addresses a critical gap in evaluating LLMs for Arabic, which is essential for ensuring the safe and ethical deployment of AI in the Arab world.

AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

arXiv · Jun 10

This paper benchmarks reasoning-focused LLMs, especially DeepSeek models, on fifteen Arabic NLP tasks. The study uses zero-shot, few-shot, and fine-tuning strategies. Key findings include that three in-context examples improve F1 scores by over 13 points on classification tasks, DeepSeek outperforms GPT-4-mini by 12 F1 points on complex inference tasks in the zero-shot setting, and LoRA fine-tuning yields up to an additional 8 points in F1 and BLEU. Why it matters: The systematic evaluation provides insights into the performance of LLMs on Arabic NLP, highlighting the effectiveness of different strategies for improving performance and contributing to the development of more capable Arabic language models.

AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

Summary

Keywords

Related

AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic

AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP