Skip to content
GCC AI Research

Search

Results for "politeness detection"

ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics

arXiv ·

The paper introduces ADAB (Arabic Politeness Dataset), a new annotated Arabic dataset for politeness detection collected from online platforms. The dataset covers Modern Standard Arabic and multiple dialects (Gulf, Egyptian, Levantine, and Maghrebi). It contains 10,000 samples across 16 politeness categories and achieves substantial inter-annotator agreement (kappa = 0.703). Why it matters: This dataset addresses the under-explored area of Arabic-language resources for politeness detection, which is crucial for culturally-aware NLP systems.

Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021

arXiv ·

This paper introduces two shared tasks for abusive and threatening language detection in Urdu, a low-resource language with over 170 million speakers. The tasks involve binary classification of Urdu tweets into Abusive/Non-Abusive and Threatening/Non-Threatening categories, respectively. Datasets of 2400/6000 training tweets and 1100/3950 testing tweets were created and manually annotated, along with logistic regression and BERT-based baselines. 21 teams participated and the best systems achieved F1-scores of 0.880 and 0.545 on the abusive and threatening language tasks, respectively, with m-BERT showing the best performance.

Detecting Propaganda Techniques in Code-Switched Social Media Text

arXiv ·

This paper introduces a new task: detecting propaganda techniques in code-switched text. The authors created and released a corpus of 1,030 English-Roman Urdu code-switched texts annotated with 20 propaganda techniques. Experiments show the importance of directly modeling multilinguality and using the right fine-tuning strategy for this task.

FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models

arXiv ·

The paper introduces FanarGuard, a bilingual moderation filter for Arabic and English language models that considers both safety and cultural alignment. A dataset of 468K prompt-response pairs was created and scored by LLM judges on harmlessness and cultural awareness to train the filter. The first benchmark targeting Arabic cultural contexts was developed to evaluate cultural alignment. Why it matters: FanarGuard advances context-sensitive AI safeguards by integrating cultural awareness into content moderation, addressing a critical gap in current alignment techniques.

ALDi: Quantifying the Arabic Level of Dialectness of Text

arXiv ·

The paper introduces the concept of Arabic Level of Dialectness (ALDi), a continuous variable representing the degree of dialectal Arabic in a sentence, arguing that Arabic exists on a spectrum between MSA and DA. They present the AOC-ALDi dataset, comprising 127,835 sentences manually labeled for dialectness level, derived from news articles and user comments. Experiments show a model trained on AOC-ALDi can identify dialectness levels across various corpora and genres. Why it matters: ALDi provides a more nuanced approach to analyzing Arabic text than binary dialect identification, enabling sociolinguistic analysis of stylistic choices.

A Cross-cultural Corpus of Annotated Verbal and Nonverbal Behaviors in Receptionist Encounters

arXiv ·

Researchers created a cross-cultural corpus of annotated verbal and nonverbal behaviors in receptionist interactions. The corpus includes native speakers of American English and Arabic role-playing scenarios at university reception desks in Doha, Qatar, and Pittsburgh, USA. The manually annotated nonverbal behaviors include gaze direction, hand gestures, torso positions, and facial expressions. Why it matters: This resource can be valuable for the human-robot interaction community, especially for building culturally aware AI systems.