Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation

arXiv · May 24, 2023 · Significant research

Summary

MBZUAI releases Bactrian-X, a multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. They trained low-rank adaptation (LoRA) adapters using this dataset, creating lightweight, replaceable components for large language models. Experiments show the LoRA-based models outperform vanilla and existing instruction-tuned models in multilingual settings.

Keywords

multilingual · instruction tuning · LoRA · MBZUAI · Bactrian-X

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Fined tuned across languages: improving LLM instruction following beyond English

MBZUAI · Invalid Date

MBZUAI researchers created Bactrian-X, a new dataset to improve LLM instruction following in low-resource languages. The dataset leverages instruction tuning, pairing instructions in various languages with expected responses. Bactrian-X builds upon existing open-source instruction tuning models. Why it matters: This work aims to democratize access to LLMs by enabling users to interact with them in their native languages, even when English proficiency is limited.

Resource-Aware Arabic LLM Creation: Model Adaptation, Integration, and Multi-Domain Testing

arXiv · Dec 23

Researchers fine-tuned the Qwen2-1.5B model for Arabic using QLoRA on a 4GB VRAM system, using datasets like Bactrian and Arabic Wikipedia. They addressed challenges in Arabic NLP including morphology and dialectal variations. After 10,000 training steps, the final loss converged to 0.1083 with improved handling of Arabic-specific linguistic phenomena. Why it matters: This demonstrates a resource-efficient approach for creating specialized Arabic language models, democratizing access to advanced NLP technologies.

Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

arXiv · Jul 6

The authors introduce Nile-Chat, a collection of LLMs (4B, 3x4B-A6B, and 12B) specifically for the Egyptian dialect, capable of understanding and generating text in both Arabic and Latin scripts. A novel language adaptation approach using the Branch-Train-MiX strategy is used to merge script-specialized experts into a single MoE model. Nile-Chat models outperform multilingual and Arabic LLMs like LLaMa, Jais, and ALLaM on newly introduced Egyptian benchmarks, with the 12B model achieving a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks; all resources are publicly available. Why it matters: This work addresses the overlooked aspect of adapting LLMs to dual-script languages, providing a methodology for creating more inclusive and representative language models in the Arabic-speaking world.