Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

arXiv · February 28, 2025 · Significant research

Summary

A new culturally inclusive and linguistically diverse dataset called Palm for Arabic LLMs is introduced, covering 22 Arab countries and featuring instructions in both Modern Standard Arabic (MSA) and dialectal Arabic (DA) across 20 topics. The dataset was built through a year-long community-driven project involving 44 researchers from across the Arab world. Evaluation of frontier LLMs using the dataset reveals limitations in cultural and dialectal understanding, with some countries being better represented than others.

Keywords

Arabic · LLM · dataset · cultural sensitivity · dialectal Arabic

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

arXiv · Jul 13

The paper introduces NativQA, a language-independent framework for constructing culturally and regionally aligned QA datasets in native languages. Using the framework, the authors created MultiNativQA, a multilingual natural QA dataset consisting of ~64k manually annotated QA pairs in seven languages. The dataset covers queries from native speakers from 9 regions covering 18 topics, and is designed for evaluating and tuning LLMs. Why it matters: The framework and dataset enable the creation of more culturally relevant and effective LLMs for diverse linguistic communities, including those in the Middle East.

Commonsense Reasoning in Arab Culture

arXiv · Feb 18

A new dataset called ArabCulture is introduced to address the lack of culturally relevant commonsense reasoning resources in Arabic AI. The dataset covers 13 countries across the Gulf, Levant, North Africa, and the Nile Valley, spanning 12 daily life domains with 54 fine-grained subtopics. It was built from scratch by native speakers writing and validating culturally relevant questions. Why it matters: The dataset highlights the need for more culturally aware models and benchmarks tailored to the Arabic-speaking world, moving beyond machine-translated resources.

Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

Summary

Keywords

Related

NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

Commonsense Reasoning in Arab Culture