This paper introduces AraLLaMA, a new Arabic large language model (LLM) trained using a progressive vocabulary expansion method inspired by second language acquisition. The model utilizes a modified byte-pair encoding (BPE) algorithm to dynamically extend the Arabic subwords in its vocabulary during training, balancing the out-of-vocabulary (OOV) ratio. Experiments show AraLLaMA achieves performance comparable to existing Arabic LLMs on various benchmarks, and all models, data, and code will be open-sourced. Why it matters: This work addresses the need for more accessible and performant Arabic LLMs, contributing to democratization of AI in the Arab world.
This article discusses retrieval augmentation in text generation, where information retrieved from an external source is used to condition predictions. It references recent work on retrieval-augmented image captioning, showing that model size can be greatly reduced when training data is available through retrieval. The author intends to continue this work focusing on the intersection of retrieval augmentation and in-context learning, and controllable image captioning for language learning materials. Why it matters: This research direction has the potential to improve transfer learning in vision-language models, which could be especially relevant for downstream applications in Arabic NLP and multimodal tasks.
The UAE government is developing large language models (LLMs) specifically for the Arabic language, with a target training dataset of 20 million words. This initiative aims to overcome the underrepresentation of Arabic in existing AI models. The project seeks to enhance AI's ability to understand and generate nuanced Arabic content. Why it matters: A national Arabic LLM can enable culturally relevant AI applications across various sectors in the region, from education to government services.
Cerebras Systems is planning a significant expansion in the UAE, aiming to deploy more of its AI supercomputers in the region. The company is partnering with Abu Dhabi's G42 to provide the computing power necessary for G42's AI models. Cerebras already has three of its systems deployed in the UAE and anticipates further installations. Why it matters: This expansion highlights the UAE's growing importance as a hub for AI development and the increasing demand for high-performance computing in the region.