Skip to content
GCC AI Research

Search

Results for "Evaluation"

Evaluating Models and their Explanations

MBZUAI ·

This article discusses the increasing concerns about the interpretability of large deep learning models. It highlights a talk by Danish Pruthi, an Assistant Professor at the Indian Institute of Science (IISc), Bangalore, who presented a framework to quantify the value of explanations and the need for holistic model evaluation. Pruthi's talk touched on geographically representative artifacts from text-to-image models and how well conversational LLMs challenge false assumptions. Why it matters: Addressing interpretability and evaluation is crucial for building trustworthy and reliable AI systems, particularly in sensitive applications within the Middle East and globally.

UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat

arXiv ·

This paper presents a UI-level evaluation of ALLaM-34B, an Arabic-centric LLM developed by SDAIA and deployed in the HUMAIN Chat service. The evaluation used a prompt pack spanning various Arabic dialects, code-switching, reasoning, and safety, with outputs scored by frontier LLM judges. Results indicate strong performance in generation, code-switching, MSA handling, reasoning, and improved dialect fidelity, positioning ALLaM-34B as a robust Arabic LLM suitable for real-world use.

From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation

arXiv ·

This paper introduces a novel evaluation framework for Arabic language models, addressing gaps in linguistic accuracy and cultural alignment. The authors analyze existing datasets and present the Arabic Depth Mini Dataset (ADMD), a curated collection of 490 questions across ten domains. Evaluating GPT-4, Claude 3.5 Sonnet, Gemini Flash 1.5, CommandR 100B, and Qwen-Max using ADMD reveals performance variations, with Claude 3.5 Sonnet achieving the highest accuracy at 30%. Why it matters: The work emphasizes the importance of cultural competence in Arabic language model evaluation, providing practical insights for improvement.

GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP

arXiv ·

This paper presents a comprehensive evaluation of ChatGPT's performance across 44 Arabic NLP tasks using over 60 datasets. The study compares ChatGPT's capabilities in Modern Standard Arabic (MSA) and Dialectal Arabic (DA) against smaller, fine-tuned models. Results show ChatGPT is outperformed by smaller, fine-tuned models and exhibits limitations in handling Arabic dialects compared to MSA. Why it matters: The work highlights the need for further research and development of Arabic-specific NLP models to overcome the limitations of general-purpose models like ChatGPT.

Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation

arXiv ·

The paper introduces Ara-HOPE, a human-centric post-editing evaluation framework for Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation. Ara-HOPE includes a five-category error taxonomy and a decision-tree annotation protocol designed to address the challenges of dialect-specific MT errors. Evaluation of Jais, GPT-3.5, and NLLB-200 shows dialect-specific terminology and semantic preservation remain key challenges. Why it matters: The new framework and public dataset will help improve the evaluation and development of dialect-aware MT systems for Arabic.

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

arXiv ·

Researchers from the National Center for AI in Saudi Arabia investigated the sensitivity of Large Language Model (LLM) leaderboards to minor benchmark perturbations. They found that small changes, like choice order, can shift rankings by up to 8 positions. The study recommends hybrid scoring and warns against over-reliance on simple benchmark evaluations, providing code for further research.