Skip to content
GCC AI Research

Sadeed: Advancing Arabic Diacritization Through Small Language Model

arXiv · · Notable

Summary

The paper introduces Sadeed, a fine-tuned decoder-only language model based on the Kuwain 1.5B Hennara model, for improved Arabic text diacritization. Sadeed is fine-tuned on high-quality diacritized datasets and achieves competitive results compared to larger proprietary models. The authors also introduce SadeedDiac-25, a new benchmark for fairer evaluation of Arabic diacritization across diverse text genres. Why it matters: This work advances Arabic NLP by providing both a competitive diacritization model and a more robust evaluation benchmark, facilitating further research and development in the field.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

arXiv ·

The paper addresses the challenge of missing diacritics in Arabic NLP by exploring naturally occurring diacritics in a new dataset across six genres. It maps partially diacritized words to their full diacritization and proposes extensions to the analyze-and-disambiguate approach. The extended diacritization algorithm achieves notable improvements, and the code/datasets are released as open source. Why it matters: This research provides valuable resources and methods for improving Arabic text processing, especially in contexts where diacritization is crucial for accurate interpretation.

Supporting Undotted Arabic with Pre-trained Language Models

arXiv ·

The paper examines the performance of pre-trained Arabic language models on Arabic text intentionally stripped of diacritical dots to evade content classification. It proposes methods to support these "undotted" texts without retraining the models. The proposed methods achieve nearly perfect performance on one downstream task. Why it matters: The research highlights a vulnerability in Arabic NLP and offers solutions to maintain performance in the face of adversarial text manipulation.

Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset

arXiv ·

A new dataset for Arabic proper noun diacritization was introduced, addressing the ambiguity caused by undiacritized proper nouns in Arabic Wikipedia. The dataset includes manually diacritized Arabic proper nouns of various origins along with their English Wikipedia glosses. GPT-4o was benchmarked on the task of recovering full diacritization from undiacritized Arabic and English forms, achieving 73% accuracy. Why it matters: The release of this dataset should facilitate further research on Arabic Wikipedia proper noun diacritization, improving the accessibility and accuracy of Arabic NLP resources.

Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic

arXiv ·

The paper introduces Arabic Stable LM, a 1.6B parameter Arabic-centric language model, in both base and chat versions. The Arabic Stable LM 1.6B chat model achieves strong results on several benchmarks, outperforming models with up to 8x more parameters. The study also demonstrates the benefit of incorporating synthetic instruction tuning data through a large synthetic dialogue dataset. Why it matters: This work makes Arabic LLMs more accessible by reducing the parameter size while maintaining strong performance, facilitating deployment in resource-constrained environments.