Skip to content
GCC AI Research

Search

Results for "ArabicMMLU"

A new standard for evaluating Arabic language models presented at ACL

MBZUAI ·

MBZUAI researchers have created ArabicMMLU, the first benchmark dataset in Modern Standard Arabic for evaluating language understanding across multiple tasks. The dataset contains over 14,000 multiple-choice questions from school exams across the Arabic-speaking world and addresses the limitations of translated English datasets. It was presented at the 62nd Annual Meeting of the Association for Computational Linguistics in Bangkok. Why it matters: This benchmark enables a more accurate and culturally relevant evaluation of LLMs' capabilities in Arabic, which is crucial for developing AI tailored to the Arab world.

ALLaM: Large Language Models for Arabic and English

arXiv ·

The paper introduces ALLaM, a series of large language models for Arabic and English, designed to support Arabic Language Technologies. The models are trained with language alignment and knowledge transfer in mind, using a decoder-only architecture. ALLaM achieves state-of-the-art results on Arabic benchmarks like MMLU Arabic and Arabic Exams. Why it matters: This work advances Arabic NLP by providing high-performing LLMs and demonstrating effective techniques for cross-lingual transfer learning and alignment with human preferences.

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

arXiv ·

MBZUAI researchers introduce ARB, the first comprehensive benchmark for evaluating step-by-step multimodal reasoning in Arabic across textual and visual modalities. The benchmark spans 11 diverse domains and includes 1,356 multimodal samples with 5,119 human-curated reasoning steps. Evaluations of 12 state-of-the-art LMMs revealed challenges in coherence, faithfulness, and cultural grounding, highlighting the need for culturally aware AI systems.

SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

arXiv ·

The paper introduces SalamahBench, a new benchmark for evaluating the safety of Arabic Language Models (ALMs). The benchmark comprises 8,170 prompts across 12 categories aligned with the MLCommons Safety Hazard Taxonomy. Five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, were evaluated using the benchmark. Why it matters: The benchmark enables standardized, category-aware safety evaluation, highlighting the necessity of specialized safeguard mechanisms for robust harm mitigation in ALMs.