Skip to content
GCC AI Research

Search

Results for "evaluation framework"

Evaluating Models and their Explanations

MBZUAI ·

This article discusses the increasing concerns about the interpretability of large deep learning models. It highlights a talk by Danish Pruthi, an Assistant Professor at the Indian Institute of Science (IISc), Bangalore, who presented a framework to quantify the value of explanations and the need for holistic model evaluation. Pruthi's talk touched on geographically representative artifacts from text-to-image models and how well conversational LLMs challenge false assumptions. Why it matters: Addressing interpretability and evaluation is crucial for building trustworthy and reliable AI systems, particularly in sensitive applications within the Middle East and globally.

From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation

arXiv ·

This paper introduces a novel evaluation framework for Arabic language models, addressing gaps in linguistic accuracy and cultural alignment. The authors analyze existing datasets and present the Arabic Depth Mini Dataset (ADMD), a curated collection of 490 questions across ten domains. Evaluating GPT-4, Claude 3.5 Sonnet, Gemini Flash 1.5, CommandR 100B, and Qwen-Max using ADMD reveals performance variations, with Claude 3.5 Sonnet achieving the highest accuracy at 30%. Why it matters: The work emphasizes the importance of cultural competence in Arabic language model evaluation, providing practical insights for improvement.

Auto-assessment of assessment: A conceptual framework towards fulfilling the policy gaps in academic assessment practices

arXiv ·

This paper introduces an AI framework for autonomous assessment of student work, addressing policy gaps in academic practices. A survey of 117 academics from the UK, UAE, and Iraq reveals positive attitudes toward AI in education, particularly for autonomous assessment. The study also highlights a lack of awareness of modern AI tools among experienced academics, emphasizing the need for updated policies and training.

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

arXiv ·

Researchers have introduced LLMeBench, a customizable framework for evaluating large language models (LLMs) across diverse NLP tasks and languages. The framework features generic dataset loaders, multiple model providers, and pre-implemented evaluation metrics, supporting in-context learning with zero- and few-shot settings. LLMeBench was tested on 31 unique NLP tasks using 53 datasets across 90 experimental setups with 296K data points, and the code has been open-sourced. Why it matters: The framework's flexibility and ease of customization should accelerate LLM benchmarking, especially for Arabic and other low-resource languages.

OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs

arXiv ·

MBZUAI researchers release OpenFactCheck, a unified framework to evaluate the factual accuracy of large language models. The framework includes modules for response evaluation, LLM evaluation, and fact-checker evaluation. OpenFactCheck is available as an open-source Python library, a web service, and via GitHub.

Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation

arXiv ·

The paper introduces Ara-HOPE, a human-centric post-editing evaluation framework for Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation. Ara-HOPE includes a five-category error taxonomy and a decision-tree annotation protocol designed to address the challenges of dialect-specific MT errors. Evaluation of Jais, GPT-3.5, and NLLB-200 shows dialect-specific terminology and semantic preservation remain key challenges. Why it matters: The new framework and public dataset will help improve the evaluation and development of dialect-aware MT systems for Arabic.