Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

arXiv · October 15, 2025 · Significant research

Summary

This survey paper analyzes over 40 benchmarks used to evaluate Arabic large language models, categorizing them into Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. It identifies progress in benchmark diversity but also highlights gaps like limited temporal evaluation and cultural misalignment. The paper also examines methods for creating benchmarks, including native collection, translation, and synthetic generation. Why it matters: The survey provides a comprehensive reference for Arabic NLP research and offers recommendations for future benchmark development to better align with cultural contexts.

Keywords

Arabic LLM · benchmarks · evaluation · datasets · cultural alignment

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Large Language Models and Arabic Content: A Review

arXiv · May 12

This study reviews the use of large language models (LLMs) for Arabic language processing, focusing on pre-trained models and their applications. It highlights the challenges in Arabic NLP due to the language's complexity and the relative scarcity of resources. The review also discusses how techniques like fine-tuning and prompt engineering enhance model performance on Arabic benchmarks. Why it matters: This overview helps consolidate research directions and benchmarks in Arabic NLP, guiding future development of LLMs tailored for the Arabic language and its diverse dialects.

The Landscape of Arabic Large Language Models (ALLMs): A New Era for Arabic Language Technology

arXiv · Jun 2

This article surveys the landscape of Arabic Large Language Models (ALLMs), tracing their evolution from early text processing systems to sophisticated AI models. It highlights the unique challenges and opportunities in developing ALLMs for the 422 million Arabic speakers across 27 countries. The paper also examines the evaluation of ALLMs through benchmarks and public leaderboards. Why it matters: ALLMs can bridge technological gaps and empower Arabic-speaking communities by catering to their specific linguistic and cultural needs.

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

Summary

Keywords

Related

Large Language Models and Arabic Content: A Review

The Landscape of Arabic Large Language Models (ALLMs): A New Era for Arabic Language Technology