Skip to content
GCC AI Research

20 million words and counting: UAE’s grand plan to power Arabic with AI - Gulf Business

WAM · · Significant research

Summary

The UAE government is developing large language models (LLMs) specifically for the Arabic language, with a target training dataset of 20 million words. This initiative aims to overcome the underrepresentation of Arabic in existing AI models. The project seeks to enhance AI's ability to understand and generate nuanced Arabic content. Why it matters: A national Arabic LLM can enable culturally relevant AI applications across various sectors in the region, from education to government services.

Keywords

LLM · Arabic · UAE · NLP · AI

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

AI in Arabic? How Gulf could soon lead Artificial Intelligence race - Khaleej Times

Khaleej Times ·

The Gulf region is making significant investments in artificial intelligence, particularly in Arabic NLP. Recent developments include large language models trained on Arabic data and initiatives to promote AI ethics and policy. Why it matters: These investments aim to position the Gulf as a leader in AI, especially in leveraging the Arabic language and culture.

AI and the Arabic language: Preserving cultural heritage and enabling future discovery

MBZUAI ·

This article discusses MBZUAI's efforts in advancing Arabic language AI, including the development of advanced linguistic models using deep learning techniques. Key initiatives include Jais, a 13B parameter Arabic LLM developed in collaboration with G42's Inception, and Atlas-Chat, which understands the Moroccan dialect. The university is also incorporating Arabic in practical AI solutions like BiMediX2, a healthcare multi-modal model that understands medical queries in both English and Arabic. Why it matters: These initiatives are crucial for preserving Arabic cultural heritage, enabling future discovery, and addressing linguistic challenges specific to the Arabic language in AI applications.

101 Billion Arabic Words Dataset

arXiv ·

Researchers compiled a 101 Billion Arabic Words Dataset by mining text from Common Crawl WET files and rigorously cleaning and deduplicating the extracted content. The dataset aims to address the scarcity of original, high-quality Arabic linguistic data, which often leads to bias in Arabic LLMs that rely on translated English data. This is the largest Arabic dataset available to date. Why it matters: The new dataset can significantly contribute to the development of authentic Arabic LLMs that are more linguistically and culturally accurate.

Natural language processing is at the top of MBZUAI’s agenda

MBZUAI ·

MBZUAI is prioritizing natural language processing (NLP) research, aiming to be a top university in the field within 12-18 months according to Professor Timothy Baldwin. MBZUAI's NLP department is focusing on deep learning, algorithmic fairness, computational social science and social media analytics. A key area is Arabic NLP, addressing the challenges of dialectal variations and code-switching in social media. Why it matters: This focus on Arabic NLP and real-world problem-solving will contribute to the UAE's ambitious agenda of growing a local AI industry and integrating AI into various sectors.