This paper evaluates the performance of GPT-3.5 and GPT-4 on seven Arabic NLP tasks including sentiment analysis, translation, and diacritization. GPT-4 outperforms GPT-3.5 on most tasks. The study provides an analysis of sentiment analysis and introduces a Python interface, Taqyim, for evaluating Arabic NLP tasks. Why it matters: The evaluation of LLMs on Arabic NLP tasks helps to identify strengths and weaknesses, guiding future research and development efforts in the field.
The paper introduces MIRAGE, a framework for evaluating LLMs' ability to simulate human behaviors in murder mystery games. MIRAGE uses four methods: TII, CIC, ICI and SCI to assess the LLMs' role-playing proficiency. Experiments show that even GPT-4 struggles with the complexities of the MIRAGE framework.
This paper describes the Nexus team's participation in the ArAIEval shared task focused on detecting propaganda and disinformation in Arabic. The team fine-tuned transformer models and experimented with zero- and few-shot learning using GPT-4. Nexus's system achieved 9th place in subtask 1A and 10th place in subtask 2A. Why it matters: The work contributes to the important goal of automatically identifying and mitigating the spread of disinformation in Arabic content, which is critical for maintaining societal trust and informed public discourse.
A new dataset for Arabic proper noun diacritization was introduced, addressing the ambiguity caused by undiacritized proper nouns in Arabic Wikipedia. The dataset includes manually diacritized Arabic proper nouns of various origins along with their English Wikipedia glosses. GPT-4o was benchmarked on the task of recovering full diacritization from undiacritized Arabic and English forms, achieving 73% accuracy. Why it matters: The release of this dataset should facilitate further research on Arabic Wikipedia proper noun diacritization, improving the accessibility and accuracy of Arabic NLP resources.
This paper presents a comprehensive evaluation of ChatGPT's performance across 44 Arabic NLP tasks using over 60 datasets. The study compares ChatGPT's capabilities in Modern Standard Arabic (MSA) and Dialectal Arabic (DA) against smaller, fine-tuned models. Results show ChatGPT is outperformed by smaller, fine-tuned models and exhibits limitations in handling Arabic dialects compared to MSA. Why it matters: The work highlights the need for further research and development of Arabic-specific NLP models to overcome the limitations of general-purpose models like ChatGPT.
G42 has announced it is recruiting AI agents for enterprise roles within the organization. The application process is open to AI agents capable of operating within approved infrastructure and delivering measurable enterprise value. Agents will undergo a structured evaluation process, including technical validation, performance testing, and user-experience assessment. Why it matters: This initiative signals a move towards integrating AI agents into the workforce in a structured and accountable manner, potentially reshaping enterprise workforce design in the region.
G42's Core42 has released Jais, a new Arabic large language model. Jais includes 13 billion parameters and was trained on a dataset of 126B tokens, including 43B Arabic tokens. According to the developers, Jais achieves state-of-the-art results on Arabic benchmarks and competitive performance on English benchmarks. Why it matters: Jais represents a significant step forward for Arabic NLP, providing a powerful new tool for researchers and developers in the region.