This paper evaluates the performance of GPT-3.5 and GPT-4 on seven Arabic NLP tasks including sentiment analysis, translation, and diacritization. GPT-4 outperforms GPT-3.5 on most tasks. The study provides an analysis of sentiment analysis and introduces a Python interface, Taqyim, for evaluating Arabic NLP tasks. Why it matters: The evaluation of LLMs on Arabic NLP tasks helps to identify strengths and weaknesses, guiding future research and development efforts in the field.
MBZUAI's Qirong Ho and colleagues are developing an Artificial Intelligence Operating System (AIOS) for decarbonization, aiming to reduce energy waste in AI development. The AIOS focuses on improving communication efficiency between machines during AI model training, as inefficient communication leads to prolonged tasks and increased energy consumption. This system addresses the high computing power demands of large language models like ChatGPT and LLaMA-2. Why it matters: By optimizing energy usage in AI development, the AIOS could significantly reduce the carbon footprint of AI technologies in the region and globally.
This paper presents a comprehensive evaluation of ChatGPT's performance across 44 Arabic NLP tasks using over 60 datasets. The study compares ChatGPT's capabilities in Modern Standard Arabic (MSA) and Dialectal Arabic (DA) against smaller, fine-tuned models. Results show ChatGPT is outperformed by smaller, fine-tuned models and exhibits limitations in handling Arabic dialects compared to MSA. Why it matters: The work highlights the need for further research and development of Arabic-specific NLP models to overcome the limitations of general-purpose models like ChatGPT.
This research explores the use of generative AI, specifically ChatGPT, to create student assessments that align with academic accreditation standards, such as those of the National Center for Academic Accreditation in Saudi Arabia and ABET. The study introduces a method for mapping verbs used in questions to educational outcomes, enabling AI to produce and validate accreditation-compliant questions. A survey of faculty members in Saudi universities showed high acceptance rates for AI-generated exam questions and AI assistance in editing existing questions.
A new dataset for Arabic proper noun diacritization was introduced, addressing the ambiguity caused by undiacritized proper nouns in Arabic Wikipedia. The dataset includes manually diacritized Arabic proper nouns of various origins along with their English Wikipedia glosses. GPT-4o was benchmarked on the task of recovering full diacritization from undiacritized Arabic and English forms, achieving 73% accuracy. Why it matters: The release of this dataset should facilitate further research on Arabic Wikipedia proper noun diacritization, improving the accessibility and accuracy of Arabic NLP resources.
InfiAgent is a new agent framework comparable to GPT4-Agent, developed by replicating Codex. It includes InfiCoder, an open-source model for text-to-code, code-to-code, and freeform code-related QA tasks. The framework focuses on data analysis and integrates an LLM with programming capabilities and a sandbox environment for executing Python code. Why it matters: This research demonstrates the potential for advancements in AI operating systems and highlights areas where current models like GPT-4V can be improved, contributing to the broader development of more capable and versatile AI agents.
A new paper from MBZUAI researchers explores using ChatGPT to combat the spread of fake news. The researchers, including Preslav Nakov and Liangming Pan, demonstrate that ChatGPT can be used to fact-check published information. Their paper, "Fact-Checking Complex Claims with Program-Guided Reasoning," was accepted at ACL 2023. Why it matters: This research highlights the potential of large language models to address the growing challenge of misinformation, with implications for maintaining information integrity in the digital age.
Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.