Middle East AI

This Week arXiv

Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

arXiv · · Significant research

Summary

MBZUAI has released Jais and Jais-chat, two new open generative large language models (LLMs) with a focus on Arabic. The 13 billion parameter models are based on the GPT-3 architecture and pretrained on Arabic, English, and code. Evaluation shows state-of-the-art Arabic knowledge and reasoning, with competitive English performance.

Keywords

Jais · Jais-chat · MBZUAI · LLM · Arabic

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat

arXiv ·

This paper presents a UI-level evaluation of ALLaM-34B, an Arabic-centric LLM developed by SDAIA and deployed in the HUMAIN Chat service. The evaluation used a prompt pack spanning various Arabic dialects, code-switching, reasoning, and safety, with outputs scored by frontier LLM judges. Results indicate strong performance in generation, code-switching, MSA handling, reasoning, and improved dialect fidelity, positioning ALLaM-34B as a robust Arabic LLM suitable for real-world use.

Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation

arXiv ·

This paper introduces Saudi-Dialect-ALLaM, a LoRA fine-tuned version of the Saudi Arabian foundation model ALLaM-7B-Instruct-preview, designed to improve the generation of Saudi dialects (Najdi and Hijazi). The model is trained on a private dataset of 5,466 synthetic instruction-response pairs, with two variants explored: Dialect-Token and No-Token training. Results indicate that the Dialect-Token model achieves superior dialect control and fidelity compared to generic instruction models, although the dataset and model weights are not released.

Arabic Mini-ClimateGPT : A Climate Change and Sustainability Tailored Arabic LLM

arXiv ·

Researchers introduce Arabic Mini-ClimateGPT, a tailored Arabic LLM for climate change and sustainability. The model is fine-tuned on the Clima500-Instruct dataset and uses vector embedding retrieval during inference. Evaluations show the model outperforms baseline LLMs and is preferred by experts in 81.6% of cases.

Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

arXiv ·

A new culturally inclusive and linguistically diverse dataset called Palm for Arabic LLMs is introduced, covering 22 Arab countries and featuring instructions in both Modern Standard Arabic (MSA) and dialectal Arabic (DA) across 20 topics. The dataset was built through a year-long community-driven project involving 44 researchers from across the Arab world. Evaluation of frontier LLMs using the dataset reveals limitations in cultural and dialectal understanding, with some countries being better represented than others.