The paper introduces InstAr-500k, a new Arabic instruction dataset of 500,000 examples designed to improve LLM performance in Arabic. Researchers fine-tuned the open-source Gemma-7B model using InstAr-500k and evaluated it on downstream tasks, achieving strong results on Arabic NLP benchmarks. They then released GemmAr-7B-V1, a model specifically tuned for Arabic NLP tasks. Why it matters: This work addresses the lack of high-quality Arabic instruction data, potentially boosting the capabilities of Arabic language models.
The article discusses parameter-efficient fine-tuning methods for large NLP models, highlighting their importance due to the increasing size and computational demands of state-of-the-art language models. It provides an overview of these methods, presenting them in a unified view to emphasize their similarities and differences. Indraneil, a PhD candidate at TU Darmstadt's UKP Lab, is researching parameter-efficient fine-tuning, sparsity, and conditional computation methods to improve LLM performance in multilingual, multi-task settings. Why it matters: Efficient fine-tuning techniques are crucial for democratizing access to and accelerating the deployment of large language models in the region and beyond.
The Hala technical report introduces a family of Arabic-centric instruction and translation models developed using a translate-and-tune pipeline. A strong Arabic-English teacher model is compressed to FP8 and used to create bilingual supervision data. The LFM2-1.2B model is fine-tuned on this data and used to translate English instruction sets into Arabic, creating a million-scale corpus. Why it matters: The release of models, data, evaluation tools, and recipes will accelerate research and development in Arabic NLP, providing valuable resources for the community.
MBZUAI releases Bactrian-X, a multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. They trained low-rank adaptation (LoRA) adapters using this dataset, creating lightweight, replaceable components for large language models. Experiments show the LoRA-based models outperform vanilla and existing instruction-tuned models in multilingual settings.
MBZUAI researchers created Bactrian-X, a new dataset to improve LLM instruction following in low-resource languages. The dataset leverages instruction tuning, pairing instructions in various languages with expected responses. Bactrian-X builds upon existing open-source instruction tuning models. Why it matters: This work aims to democratize access to LLMs by enabling users to interact with them in their native languages, even when English proficiency is limited.
This paper introduces Saudi-Dialect-ALLaM, a LoRA fine-tuned version of the Saudi Arabian foundation model ALLaM-7B-Instruct-preview, designed to improve the generation of Saudi dialects (Najdi and Hijazi). The model is trained on a private dataset of 5,466 synthetic instruction-response pairs, with two variants explored: Dialect-Token and No-Token training. Results indicate that the Dialect-Token model achieves superior dialect control and fidelity compared to generic instruction models, although the dataset and model weights are not released.