The paper introduces ALLaM, a series of large language models for Arabic and English, designed to support Arabic Language Technologies. The models are trained with language alignment and knowledge transfer in mind, using a decoder-only architecture. ALLaM achieves state-of-the-art results on Arabic benchmarks like MMLU Arabic and Arabic Exams. Why it matters: This work advances Arabic NLP by providing high-performing LLMs and demonstrating effective techniques for cross-lingual transfer learning and alignment with human preferences.
The paper introduces Arabic Stable LM, a 1.6B parameter Arabic-centric language model, in both base and chat versions. The Arabic Stable LM 1.6B chat model achieves strong results on several benchmarks, outperforming models with up to 8x more parameters. The study also demonstrates the benefit of incorporating synthetic instruction tuning data through a large synthetic dialogue dataset. Why it matters: This work makes Arabic LLMs more accessible by reducing the parameter size while maintaining strong performance, facilitating deployment in resource-constrained environments.
Injy Hamed from NYU Abu Dhabi's CAMeL Lab presented work on Egyptian Arabic-English code-switching for ASR and MT. She discussed the ArzEn-ST speech translation corpus and compared end-to-end and hybrid systems for ASR. For MT, she presented data augmentation and word segmentation techniques to handle data scarcity, also addressing ASR evaluation challenges in code-switching. Why it matters: Research into code-switching is crucial for building NLP systems capable of processing real-world language use in the Arab world.
The paper introduces Aladdin-FTI, a system designed for generating and translating dialectal Arabic (DA). Aladdin-FTI supports text generation in Moroccan, Egyptian, Palestinian, Syrian, and Saudi dialects. It also handles bidirectional translation between these dialects, Modern Standard Arabic (MSA), and English. Why it matters: This work contributes to addressing the under-representation of Arabic dialects in NLP research and enables more inclusive Arabic language models.
The Hala technical report introduces a family of Arabic-centric instruction and translation models developed using a translate-and-tune pipeline. A strong Arabic-English teacher model is compressed to FP8 and used to create bilingual supervision data. The LFM2-1.2B model is fine-tuned on this data and used to translate English instruction sets into Arabic, creating a million-scale corpus. Why it matters: The release of models, data, evaluation tools, and recipes will accelerate research and development in Arabic NLP, providing valuable resources for the community.
This paper describes QCRI's machine translation systems for the IWSLT 2016 evaluation campaign, focusing on Arabic-English and English-Arabic tracks. They built both Phrase-based and Neural machine translation models. A Neural MT system, trained by stacking data from different genres through fine-tuning, and applying ensemble over 8 models, outperformed a strong phrase-based system by 2 BLEU points in the Arabic->English direction. Why it matters: The research highlights the early promise of neural machine translation for Arabic language pairs, demonstrating its potential to surpass traditional methods.
This survey paper reviews the landscape of Natural Language Processing (NLP) research and applications in the Arab world. It discusses the unique challenges posed by the Arabic language, such as its morphological complexity and dialectal diversity. The paper also presents a historical overview of Arabic NLP and surveys various research areas, including machine translation, sentiment analysis, and speech recognition. Why it matters: The survey provides a comprehensive resource for researchers and practitioners interested in the current state and future directions of Arabic NLP, a field critical for enabling AI technologies to serve Arabic-speaking communities.