MBZUAI researchers have released ALM Bench, a new benchmark dataset for evaluating the performance of multimodal LLMs on cultural visual question-answer tasks across 100 languages. The dataset includes over 22,000 question-answer pairs across 19 categories, with a focus on low-resource languages and cultural nuances, including three Arabic dialects. They tested 16 open- and closed-source multimodal LLMs on it, revealing a significant need for greater cultural and linguistic inclusivity. Why it matters: The benchmark aims to improve the inclusivity of multimodal AI systems by addressing the underrepresentation of low-resource languages and cultural contexts.
The paper introduces SalamahBench, a new benchmark for evaluating the safety of Arabic Language Models (ALMs). The benchmark comprises 8,170 prompts across 12 categories aligned with the MLCommons Safety Hazard Taxonomy. Five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, were evaluated using the benchmark. Why it matters: The benchmark enables standardized, category-aware safety evaluation, highlighting the necessity of specialized safeguard mechanisms for robust harm mitigation in ALMs.
Researchers have introduced LLMeBench, a customizable framework for evaluating large language models (LLMs) across diverse NLP tasks and languages. The framework features generic dataset loaders, multiple model providers, and pre-implemented evaluation metrics, supporting in-context learning with zero- and few-shot settings. LLMeBench was tested on 31 unique NLP tasks using 53 datasets across 90 experimental setups with 296K data points, and the code has been open-sourced. Why it matters: The framework's flexibility and ease of customization should accelerate LLM benchmarking, especially for Arabic and other low-resource languages.
The paper introduces ArabicNumBench, a benchmark for evaluating LLMs on Arabic number reading using both Eastern and Western Arabic numerals. It evaluates 71 models from 10 providers on 210 number reading tasks, using zero-shot, zero-shot CoT, few-shot, and few-shot CoT prompting strategies. The results show substantial performance variation, with few-shot CoT prompting achieving 2.8x higher accuracy than zero-shot approaches. Why it matters: The benchmark establishes baselines for Arabic number comprehension and provides guidance for model selection in production Arabic NLP systems.
The Open Arabic LLM Leaderboard (OALL) has been launched to benchmark Arabic language models, addressing the gap in resources for non-English NLP. It incorporates datasets like AlGhafa, ACVA, and translated versions of MMLU and EXAMS from the AceGPT suite. The leaderboard uses normalized log likelihood accuracy for tasks, built around HuggingFace’s LightEval framework. Why it matters: This initiative promotes research and development in Arabic NLP, serving over 380 million Arabic speakers by enhancing the evaluation and improvement of Arabic LLMs.
The paper introduces ALLaM, a series of large language models for Arabic and English, designed to support Arabic Language Technologies. The models are trained with language alignment and knowledge transfer in mind, using a decoder-only architecture. ALLaM achieves state-of-the-art results on Arabic benchmarks like MMLU Arabic and Arabic Exams. Why it matters: This work advances Arabic NLP by providing high-performing LLMs and demonstrating effective techniques for cross-lingual transfer learning and alignment with human preferences.