Skip to content
GCC AI Research

Search

Results for "Arabic diacritization"

Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

arXiv ·

The paper addresses the challenge of missing diacritics in Arabic NLP by exploring naturally occurring diacritics in a new dataset across six genres. It maps partially diacritized words to their full diacritization and proposes extensions to the analyze-and-disambiguate approach. The extended diacritization algorithm achieves notable improvements, and the code/datasets are released as open source. Why it matters: This research provides valuable resources and methods for improving Arabic text processing, especially in contexts where diacritization is crucial for accurate interpretation.

Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset

arXiv ·

A new dataset for Arabic proper noun diacritization was introduced, addressing the ambiguity caused by undiacritized proper nouns in Arabic Wikipedia. The dataset includes manually diacritized Arabic proper nouns of various origins along with their English Wikipedia glosses. GPT-4o was benchmarked on the task of recovering full diacritization from undiacritized Arabic and English forms, achieving 73% accuracy. Why it matters: The release of this dataset should facilitate further research on Arabic Wikipedia proper noun diacritization, improving the accessibility and accuracy of Arabic NLP resources.

Sadeed: Advancing Arabic Diacritization Through Small Language Model

arXiv ·

The paper introduces Sadeed, a fine-tuned decoder-only language model based on the Kuwain 1.5B Hennara model, for improved Arabic text diacritization. Sadeed is fine-tuned on high-quality diacritized datasets and achieves competitive results compared to larger proprietary models. The authors also introduce SadeedDiac-25, a new benchmark for fairer evaluation of Arabic diacritization across diverse text genres. Why it matters: This work advances Arabic NLP by providing both a competitive diacritization model and a more robust evaluation benchmark, facilitating further research and development in the field.

ALDi: Quantifying the Arabic Level of Dialectness of Text

arXiv ·

The paper introduces the concept of Arabic Level of Dialectness (ALDi), a continuous variable representing the degree of dialectal Arabic in a sentence, arguing that Arabic exists on a spectrum between MSA and DA. They present the AOC-ALDi dataset, comprising 127,835 sentences manually labeled for dialectness level, derived from news articles and user comments. Experiments show a model trained on AOC-ALDi can identify dialectness levels across various corpora and genres. Why it matters: ALDi provides a more nuanced approach to analyzing Arabic text than binary dialect identification, enabling sociolinguistic analysis of stylistic choices.

Supporting Undotted Arabic with Pre-trained Language Models

arXiv ·

The paper examines the performance of pre-trained Arabic language models on Arabic text intentionally stripped of diacritical dots to evade content classification. It proposes methods to support these "undotted" texts without retraining the models. The proposed methods achieve nearly perfect performance on one downstream task. Why it matters: The research highlights a vulnerability in Arabic NLP and offers solutions to maintain performance in the face of adversarial text manipulation.

A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic

arXiv ·

The paper introduces a two-step approach for transliterating Judeo-Arabic text (written in Hebrew script) into Arabic script. The method involves character-level mapping followed by post-correction to fix grammatical and orthographic errors. The authors also benchmarked LLMs on the transliteration task and demonstrate that transliteration enables the use of Arabic NLP tools on Judeo-Arabic. Why it matters: This work makes Judeo-Arabic texts more accessible to Arabic NLP, enabling processing and analysis that was previously impossible.

QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus

arXiv ·

The Qatar Computing Research Institute (QCRI) has released QASR, a 2,000-hour transcribed Arabic speech corpus collected from Aljazeera news broadcasts. The dataset features multi-dialect speech sampled at 16kHz, aligned with lightly supervised transcriptions and linguistically motivated segmentation. QCRI also released a 130M word dataset to improve language model training. Why it matters: QASR enables new research in Arabic speech recognition, dialect identification, punctuation restoration, and other NLP tasks for spoken data.