A new standard for evaluating Arabic language models presented at ACL

MBZUAI · Significant research

Summary

MBZUAI researchers have created ArabicMMLU, the first benchmark dataset in Modern Standard Arabic for evaluating language understanding across multiple tasks. The dataset contains over 14,000 multiple-choice questions from school exams across the Arabic-speaking world and addresses the limitations of translated English datasets. It was presented at the 62nd Annual Meeting of the Association for Computational Linguistics in Bangkok. Why it matters: This benchmark enables a more accurate and culturally relevant evaluation of LLMs' capabilities in Arabic, which is crucial for developing AI tailored to the Arab world.

Keywords

ArabicMMLU · MBZUAI · Arabic NLP · benchmark dataset · LLM evaluation

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

arXiv · Oct 15

This survey paper analyzes over 40 benchmarks used to evaluate Arabic large language models, categorizing them into Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. It identifies progress in benchmark diversity but also highlights gaps like limited temporal evaluation and cultural misalignment. The paper also examines methods for creating benchmarks, including native collection, translation, and synthetic generation. Why it matters: The survey provides a comprehensive reference for Arabic NLP research and offers recommendations for future benchmark development to better align with cultural contexts.

LAraBench: Benchmarking Arabic AI with Large Language Models

arXiv · May 24

LAraBench introduces a benchmark for Arabic NLP and speech processing, evaluating LLMs like GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM. The benchmark covers 33 tasks across 61 datasets, using zero-shot and few-shot learning techniques. Results show that SOTA models generally outperform LLMs in zero-shot settings, though larger LLMs with few-shot learning reduce the gap. Why it matters: This benchmark helps assess and improve the performance of LLMs on Arabic language tasks, highlighting areas where specialized models still excel.

A new standard for evaluating Arabic language models presented at ACL

Summary

Keywords

Related

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

LAraBench: Benchmarking Arabic AI with Large Language Models