Skip to content
GCC AI Research

Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

arXiv · · Significant research

Summary

A new culturally inclusive and linguistically diverse dataset called Palm for Arabic LLMs is introduced, covering 22 Arab countries and featuring instructions in both Modern Standard Arabic (MSA) and dialectal Arabic (DA) across 20 topics. The dataset was built through a year-long community-driven project involving 44 researchers from across the Arab world. Evaluation of frontier LLMs using the dataset reveals limitations in cultural and dialectal understanding, with some countries being better represented than others.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

arXiv ·

The paper introduces NativQA, a language-independent framework for constructing culturally and regionally aligned QA datasets in native languages. Using the framework, the authors created MultiNativQA, a multilingual natural QA dataset consisting of ~64k manually annotated QA pairs in seven languages. The dataset covers queries from native speakers from 9 regions covering 18 topics, and is designed for evaluating and tuning LLMs. Why it matters: The framework and dataset enable the creation of more culturally relevant and effective LLMs for diverse linguistic communities, including those in the Middle East.

Commonsense Reasoning in Arab Culture

arXiv ·

A new dataset called ArabCulture is introduced to address the lack of culturally relevant commonsense reasoning resources in Arabic AI. The dataset covers 13 countries across the Gulf, Levant, North Africa, and the Nile Valley, spanning 12 daily life domains with 54 fine-grained subtopics. It was built from scratch by native speakers writing and validating culturally relevant questions. Why it matters: The dataset highlights the need for more culturally aware models and benchmarks tailored to the Arabic-speaking world, moving beyond machine-translated resources.