Skip to content
GCC AI Research

GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP

arXiv · · Significant research

Summary

This paper presents a comprehensive evaluation of ChatGPT's performance across 44 Arabic NLP tasks using over 60 datasets. The study compares ChatGPT's capabilities in Modern Standard Arabic (MSA) and Dialectal Arabic (DA) against smaller, fine-tuned models. Results show ChatGPT is outperformed by smaller, fine-tuned models and exhibits limitations in handling Arabic dialects compared to MSA. Why it matters: The work highlights the need for further research and development of Arabic-specific NLP models to overcome the limitations of general-purpose models like ChatGPT.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models

arXiv ·

This paper evaluates the performance of GPT-3.5 and GPT-4 on seven Arabic NLP tasks including sentiment analysis, translation, and diacritization. GPT-4 outperforms GPT-3.5 on most tasks. The study provides an analysis of sentiment analysis and introduces a Python interface, Taqyim, for evaluating Arabic NLP tasks. Why it matters: The evaluation of LLMs on Arabic NLP tasks helps to identify strengths and weaknesses, guiding future research and development efforts in the field.

LAraBench: Benchmarking Arabic AI with Large Language Models

arXiv ·

LAraBench introduces a benchmark for Arabic NLP and speech processing, evaluating LLMs like GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM. The benchmark covers 33 tasks across 61 datasets, using zero-shot and few-shot learning techniques. Results show that SOTA models generally outperform LLMs in zero-shot settings, though larger LLMs with few-shot learning reduce the gap. Why it matters: This benchmark helps assess and improve the performance of LLMs on Arabic language tasks, highlighting areas where specialized models still excel.

AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic

arXiv ·

The paper introduces AraTrust, a new benchmark for evaluating the trustworthiness of LLMs when prompted in Arabic. The benchmark contains 522 multiple-choice questions covering dimensions like truthfulness, ethics, safety, and fairness. Experiments using AraTrust showed that GPT-4 performed the best, while open-source models like AceGPT 7B and Jais 13B had lower scores. Why it matters: This benchmark addresses a critical gap in evaluating LLMs for Arabic, which is essential for ensuring the safe and ethical deployment of AI in the Arab world.