Skip to content
GCC AI Research

ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics

arXiv · · Notable

Summary

The paper introduces ADAB (Arabic Politeness Dataset), a new annotated Arabic dataset for politeness detection collected from online platforms. The dataset covers Modern Standard Arabic and multiple dialects (Gulf, Egyptian, Levantine, and Maghrebi). It contains 10,000 samples across 16 politeness categories and achieves substantial inter-annotator agreement (kappa = 0.703). Why it matters: This dataset addresses the under-explored area of Arabic-language resources for politeness detection, which is crucial for culturally-aware NLP systems.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

ArabJobs: A Multinational Corpus of Arabic Job Ads

arXiv ·

The ArabJobs dataset is a new corpus of over 8,500 Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the UAE. The dataset contains over 550,000 words and captures linguistic, regional, and socio-economic variation in the Arab labor market. It is available on GitHub and can be used for fairness-aware Arabic NLP and labor market research.

From Words to Proverbs: Evaluating LLMs Linguistic and Cultural Competence in Saudi Dialects with Absher

arXiv ·

This paper introduces Absher, a new benchmark for evaluating LLMs' linguistic and cultural competence in Saudi dialects. The benchmark comprises over 18,000 multiple-choice questions spanning six categories, using dialectal words, phrases, and proverbs from various regions of Saudi Arabia. Evaluation of state-of-the-art LLMs reveals performance gaps, especially in cultural inference and contextual understanding, highlighting the need for dialect-aware training.