Skip to content
GCC AI Research

Search

Results for "Dialectal Arabic"

Revisiting Common Assumptions about Arabic Dialects in NLP

arXiv ·

This paper critically examines common assumptions about Arabic dialects used in NLP. The authors analyze a multi-label dataset where sentences in 11 country-level dialects were assessed by native speakers. The analysis reveals that widely held assumptions about dialect grouping and distinctions are oversimplified and not always accurate. Why it matters: The findings suggest that current approaches in Arabic NLP tasks like dialect identification may be limited by these inaccurate assumptions, hindering further progress in the field.

AlcLaM: Arabic Dialectal Language Model

arXiv ·

The paper introduces AlcLaM, an Arabic dialectal language model trained on 3.4M sentences from social media. AlcLaM expands the vocabulary and retrains a BERT-based model, using only 13GB of dialectal text. Despite the smaller training data, AlcLaM outperforms models like CAMeL, MARBERT, and ArBERT on various Arabic NLP tasks. Why it matters: AlcLaM offers a more efficient and accurate approach to Arabic NLP by focusing on dialectal Arabic, which is often underrepresented in existing models.

ALDi: Quantifying the Arabic Level of Dialectness of Text

arXiv ·

The paper introduces the concept of Arabic Level of Dialectness (ALDi), a continuous variable representing the degree of dialectal Arabic in a sentence, arguing that Arabic exists on a spectrum between MSA and DA. They present the AOC-ALDi dataset, comprising 127,835 sentences manually labeled for dialectness level, derived from news articles and user comments. Experiments show a model trained on AOC-ALDi can identify dialectness levels across various corpora and genres. Why it matters: ALDi provides a more nuanced approach to analyzing Arabic text than binary dialect identification, enabling sociolinguistic analysis of stylistic choices.