Skip to content
GCC AI Research

Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion

arXiv · · Significant research

Summary

This paper introduces AraLLaMA, a new Arabic large language model (LLM) trained using a progressive vocabulary expansion method inspired by second language acquisition. The model utilizes a modified byte-pair encoding (BPE) algorithm to dynamically extend the Arabic subwords in its vocabulary during training, balancing the out-of-vocabulary (OOV) ratio. Experiments show AraLLaMA achieves performance comparable to existing Arabic LLMs on various benchmarks, and all models, data, and code will be open-sourced. Why it matters: This work addresses the need for more accessible and performant Arabic LLMs, contributing to democratization of AI in the Arab world.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

ALLaM: Large Language Models for Arabic and English

arXiv ·

The paper introduces ALLaM, a series of large language models for Arabic and English, designed to support Arabic Language Technologies. The models are trained with language alignment and knowledge transfer in mind, using a decoder-only architecture. ALLaM achieves state-of-the-art results on Arabic benchmarks like MMLU Arabic and Arabic Exams. Why it matters: This work advances Arabic NLP by providing high-performing LLMs and demonstrating effective techniques for cross-lingual transfer learning and alignment with human preferences.

From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models

arXiv ·

This paper explores cross-lingual transfer in Arabic language models, which are typically pretrained on Modern Standard Arabic (MSA) but expected to generalize to diverse dialects. The study uses probing on 3 NLP tasks and representational similarity analysis to assess transfer effectiveness. Results show transfer is uneven across dialects, partially linked to geographic proximity, and models trained on all dialects exhibit negative interference. Why it matters: The findings highlight challenges in cross-lingual transfer for Arabic NLP and raise questions about dialect similarity for model training.

Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale

arXiv ·

The Hala technical report introduces a family of Arabic-centric instruction and translation models developed using a translate-and-tune pipeline. A strong Arabic-English teacher model is compressed to FP8 and used to create bilingual supervision data. The LFM2-1.2B model is fine-tuned on this data and used to translate English instruction sets into Arabic, creating a million-scale corpus. Why it matters: The release of models, data, evaluation tools, and recipes will accelerate research and development in Arabic NLP, providing valuable resources for the community.

Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic

arXiv ·

The paper introduces Arabic Stable LM, a 1.6B parameter Arabic-centric language model, in both base and chat versions. The Arabic Stable LM 1.6B chat model achieves strong results on several benchmarks, outperforming models with up to 8x more parameters. The study also demonstrates the benefit of incorporating synthetic instruction tuning data through a large synthetic dialogue dataset. Why it matters: This work makes Arabic LLMs more accessible by reducing the parameter size while maintaining strong performance, facilitating deployment in resource-constrained environments.