This paper studies the impact of data scale on Arabic Pretrained Language Models (PLMs). Researchers retrained BERT-base and T5-base models on large Arabic corpora, achieving state-of-the-art results on the ALUE and ORCA benchmarks. The analysis indicates that pretraining data volume is the most important factor for performance. Why it matters: This work provides valuable insights into building effective Arabic language models, emphasizing the importance of large, high-quality datasets for advancing Arabic NLP.
This paper introduces two methods for creating Arabic LLM prompts at scale: translating existing English prompt datasets and creating natural language prompts from Arabic NLP datasets. Using these methods, the authors generated over 67.4 million Arabic prompts covering tasks like summarization and question answering. Fine-tuning a 7B Qwen2 model on these prompts outperforms a 70B Llama3 model in handling Arabic prompts. Why it matters: The research provides a cost-effective approach to scaling Arabic LLM training data, potentially improving the performance of smaller, more accessible models for Arabic NLP.
This study reviews the use of large language models (LLMs) for Arabic language processing, focusing on pre-trained models and their applications. It highlights the challenges in Arabic NLP due to the language's complexity and the relative scarcity of resources. The review also discusses how techniques like fine-tuning and prompt engineering enhance model performance on Arabic benchmarks. Why it matters: This overview helps consolidate research directions and benchmarks in Arabic NLP, guiding future development of LLMs tailored for the Arabic language and its diverse dialects.
Arabic Language Models (LMs) are primarily pretrained on Modern Standard Arabic (MSA), with an expectation of transferring to diverse Arabic dialects for real-world applications. This work explores cross-lingual transfer in Arabic LMs using probing on three Natural Language Processing (NLP) tasks and representational similarity. The findings indicate that transfer is possible but disproportionate across dialects, with some evidence of negative interference in models trained to support all Arabic dialects. Why it matters: This research highlights crucial challenges for building robust Arabic AI systems that effectively handle the significant linguistic diversity of the Arab world.