Skip to content
GCC AI Research

Search

Results for "SalamahBench"

SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

arXiv ·

The paper introduces SalamahBench, a new benchmark for evaluating the safety of Arabic Language Models (ALMs). The benchmark comprises 8,170 prompts across 12 categories aligned with the MLCommons Safety Hazard Taxonomy. Five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, were evaluated using the benchmark. Why it matters: The benchmark enables standardized, category-aware safety evaluation, highlighting the necessity of specialized safeguard mechanisms for robust harm mitigation in ALMs.

LAraBench: Benchmarking Arabic AI with Large Language Models

arXiv ·

LAraBench introduces a benchmark for Arabic NLP and speech processing, evaluating LLMs like GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM. The benchmark covers 33 tasks across 61 datasets, using zero-shot and few-shot learning techniques. Results show that SOTA models generally outperform LLMs in zero-shot settings, though larger LLMs with few-shot learning reduce the gap. Why it matters: This benchmark helps assess and improve the performance of LLMs on Arabic language tasks, highlighting areas where specialized models still excel.

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

arXiv ·

Researchers from the National Center for AI in Saudi Arabia investigated the sensitivity of Large Language Model (LLM) leaderboards to minor benchmark perturbations. They found that small changes, like choice order, can shift rankings by up to 8 positions. The study recommends hybrid scoring and warns against over-reliance on simple benchmark evaluations, providing code for further research.

From Words to Proverbs: Evaluating LLMs Linguistic and Cultural Competence in Saudi Dialects with Absher

arXiv ·

This paper introduces Absher, a new benchmark for evaluating LLMs' linguistic and cultural competence in Saudi dialects. The benchmark comprises over 18,000 multiple-choice questions spanning six categories, using dialectal words, phrases, and proverbs from various regions of Saudi Arabia. Evaluation of state-of-the-art LLMs reveals performance gaps, especially in cultural inference and contextual understanding, highlighting the need for dialect-aware training.

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

arXiv ·

Researchers have introduced LLMeBench, a customizable framework for evaluating large language models (LLMs) across diverse NLP tasks and languages. The framework features generic dataset loaders, multiple model providers, and pre-implemented evaluation metrics, supporting in-context learning with zero- and few-shot settings. LLMeBench was tested on 31 unique NLP tasks using 53 datasets across 90 experimental setups with 296K data points, and the code has been open-sourced. Why it matters: The framework's flexibility and ease of customization should accelerate LLM benchmarking, especially for Arabic and other low-resource languages.