Skip to content
GCC AI Research

A new standard for evaluating Arabic language models presented at ACL

MBZUAI · Significant research

Summary

MBZUAI researchers have created ArabicMMLU, the first benchmark dataset in Modern Standard Arabic for evaluating language understanding across multiple tasks. The dataset contains over 14,000 multiple-choice questions from school exams across the Arabic-speaking world and addresses the limitations of translated English datasets. It was presented at the 62nd Annual Meeting of the Association for Computational Linguistics in Bangkok. Why it matters: This benchmark enables a more accurate and culturally relevant evaluation of LLMs' capabilities in Arabic, which is crucial for developing AI tailored to the Arab world.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

arXiv ·

This survey paper analyzes over 40 benchmarks used to evaluate Arabic large language models, categorizing them into Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. It identifies progress in benchmark diversity but also highlights gaps like limited temporal evaluation and cultural misalignment. The paper also examines methods for creating benchmarks, including native collection, translation, and synthetic generation. Why it matters: The survey provides a comprehensive reference for Arabic NLP research and offers recommendations for future benchmark development to better align with cultural contexts.

LAraBench: Benchmarking Arabic AI with Large Language Models

arXiv ·

LAraBench introduces a benchmark for Arabic NLP and speech processing, evaluating LLMs like GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM. The benchmark covers 33 tasks across 61 datasets, using zero-shot and few-shot learning techniques. Results show that SOTA models generally outperform LLMs in zero-shot settings, though larger LLMs with few-shot learning reduce the gap. Why it matters: This benchmark helps assess and improve the performance of LLMs on Arabic language tasks, highlighting areas where specialized models still excel.