Skip to content
GCC AI Research

How well can LLMs Grade Essays in Arabic?

arXiv · · Significant research

Summary

This research evaluates LLMs like ChatGPT, Llama, Aya, Jais, and ACEGPT on Arabic automated essay scoring (AES) using the AR-AES dataset. The study uses zero-shot, few-shot learning, and fine-tuning approaches while using a mixed-language prompting strategy. ACEGPT performed best among the LLMs with a QWK of 0.67, while a smaller BERT model achieved 0.88. Why it matters: The study highlights challenges faced by LLMs in processing Arabic and provides insights into improving LLM performance in Arabic NLP tasks.

Keywords

LLM · Arabic · Essay Scoring · ChatGPT · Jais

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

arXiv ·

The paper introduces AraHalluEval, a new framework for evaluating hallucinations in Arabic and multilingual large language models (LLMs). The framework uses 12 fine-grained hallucination indicators across generative question answering and summarization tasks, evaluating 12 LLMs including Arabic-specific, multilingual, and reasoning-based models. Results show factual hallucinations are more common than faithfulness errors, with the Arabic model Allam showing lower hallucination rates. Why it matters: This work addresses a critical gap in Arabic NLP by providing a comprehensive tool for assessing and mitigating hallucination in LLMs, which is essential for reliable AI applications in the Arabic-speaking world.

Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks

arXiv ·

This paper benchmarks the performance of large language models (LLMs) on Arabic medical natural language processing tasks using the AraHealthQA dataset. The study evaluated LLMs in multiple-choice question answering, fill-in-the-blank, and open-ended question answering scenarios. The results showed that a majority voting solution using Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3 achieved 77% accuracy on MCQs, while other LLMs achieved a BERTScore of 86.44% on open-ended questions. Why it matters: The research highlights both the potential and limitations of current LLMs in Arabic clinical contexts, providing a baseline for future improvements in Arabic medical AI.

Prediction of Arabic Legal Rulings using Large Language Models

arXiv ·

This paper introduces a predictive analysis of Arabic court decisions, utilizing 10,813 real commercial court cases. The study evaluates LLaMA-7b, JAIS-13b, and GPT3.5-turbo models under zero-shot, one-shot, and fine-tuned training paradigms, also experimenting with summarization and translation. GPT-3.5 models significantly outperformed others, exceeding JAIS model performance by 50%, while also demonstrating the unreliability of most automated metrics. Why it matters: This research bridges computational linguistics and Arabic legal analytics, offering insights for enhancing judicial processes and legal strategies in the Arabic-speaking world.

The Arabic AI Fingerprint: Stylometric Analysis and Detection of Large Language Models Text

arXiv ·

This paper analyzes Arabic text generated by LLMs like ALLaM, Jais, Llama, and GPT-4 across academic and social media domains using stylometric analysis. The study found detectable linguistic patterns that differentiate human-written from machine-generated Arabic text. BERT-based detection models achieved up to 99.9% F1-score in formal contexts, though cross-domain generalization remains a challenge. Why it matters: The research lays groundwork for detecting AI-generated misinformation in Arabic, a crucial step for preserving information integrity in Arabic-language contexts.