Evaluation

7 articles RSS ↗

Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation

arXiv · Dec 25 · NLP Arabic AI

The paper introduces Ara-HOPE, a human-centric post-editing evaluation framework for Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation. Ara-HOPE includes a five-category error taxonomy and a decision-tree annotation protocol designed to address the challenges of dialect-specific MT errors. Evaluation of Jais, GPT-3.5, and NLLB-200 shows dialect-specific terminology and semantic preservation remain key challenges. Why it matters: The new framework and public dataset will help improve the evaluation and development of dialect-aware MT systems for Arabic.

Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning

arXiv · Jul 30 · NLP Arabic AI

This paper introduces a nested embedding learning framework for Arabic NLP, utilizing Matryoshka Embedding Learning and multilingual models. The authors translated sentence similarity datasets into Arabic to enable comprehensive evaluation. Experiments on the Arabic Natural Language Inference dataset show Matryoshka embedding models outperform traditional models by 20-25% in capturing Arabic semantic nuances. Why it matters: This work advances Arabic NLP by providing a new method and evaluation benchmark for semantic similarity, which is crucial for tasks like information retrieval and text understanding.

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

arXiv · May 6 · CV LLM

Researchers from MBZUAI have introduced the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) for assessing Video-LLMs. The benchmark evaluates models across 11 real-world video dimensions, revealing challenges in robustness and reasoning, particularly for open-source models. A training-free Dual-Step Contextual Prompting (DSCP) technique is proposed to enhance Video-LMM performance, with the dataset and code made publicly available.

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

arXiv · Feb 1 · LLM Research

Researchers from the National Center for AI in Saudi Arabia investigated the sensitivity of Large Language Model (LLM) leaderboards to minor benchmark perturbations. They found that small changes, like choice order, can shift rankings by up to 8 positions. The study recommends hybrid scoring and warns against over-reliance on simple benchmark evaluations, providing code for further research.

GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP

arXiv · May 24 · NLP LLM

This paper presents a comprehensive evaluation of ChatGPT's performance across 44 Arabic NLP tasks using over 60 datasets. The study compares ChatGPT's capabilities in Modern Standard Arabic (MSA) and Dialectal Arabic (DA) against smaller, fine-tuned models. Results show ChatGPT is outperformed by smaller, fine-tuned models and exhibits limitations in handling Arabic dialects compared to MSA. Why it matters: The work highlights the need for further research and development of Arabic-specific NLP models to overcome the limitations of general-purpose models like ChatGPT.

How computer vision model architecture and training affect performance

MBZUAI · Mar 25 · CV Research

MBZUAI researchers found that ImageNet performance isn't always indicative of real-world task performance for computer vision models. The study analyzed four popular model configurations, revealing variations in behavior on specific image types despite similar overall ImageNet accuracy. It indicates that certain model configurations are better suited for particular tasks, even with lower ImageNet scores. Why it matters: This challenges the reliance on ImageNet as a sole benchmark and highlights the need for task-specific evaluations in computer vision.

NADI 2022: The Third Nuanced Arabic Dialect Identification Shared Task

arXiv · Oct 18 · NLP Arabic AI

The third Nuanced Arabic Dialect Identification Shared Task (NADI 2022) focused on advancing Arabic NLP through dialect identification and sentiment analysis at the country level. A total of 21 teams participated, with the winning team achieving 27.06 F1 score on dialect identification and 75.16 F1 score on sentiment analysis. The task highlights the challenges in Arabic dialect processing and motivates further research. Why it matters: Standardized evaluations like NADI are crucial for benchmarking progress and fostering innovation in Arabic NLP, especially for dialectal variations.