Skip to content
GCC AI Research

Search

Results for "Latin script"

A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic

arXiv ·

The paper introduces a two-step approach for transliterating Judeo-Arabic text (written in Hebrew script) into Arabic script. The method involves character-level mapping followed by post-correction to fix grammatical and orthographic errors. The authors also benchmarked LLMs on the transliteration task and demonstrate that transliteration enables the use of Arabic NLP tools on Judeo-Arabic. Why it matters: This work makes Judeo-Arabic texts more accessible to Arabic NLP, enabling processing and analysis that was previously impossible.

Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

arXiv ·

The authors introduce Nile-Chat, a collection of LLMs (4B, 3x4B-A6B, and 12B) specifically for the Egyptian dialect, capable of understanding and generating text in both Arabic and Latin scripts. A novel language adaptation approach using the Branch-Train-MiX strategy is used to merge script-specialized experts into a single MoE model. Nile-Chat models outperform multilingual and Arabic LLMs like LLaMa, Jais, and ALLaM on newly introduced Egyptian benchmarks, with the 12B model achieving a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks; all resources are publicly available. Why it matters: This work addresses the overlooked aspect of adapting LLMs to dual-script languages, providing a methodology for creating more inclusive and representative language models in the Arabic-speaking world.

AlexU-Word: A New Dataset for Isolated-Word Closed-Vocabulary Offline Arabic Handwriting Recognition

arXiv ·

Researchers from Alexandria University introduce AlexU-Word, a new dataset for offline Arabic handwriting recognition. The dataset contains 25,114 samples of 109 unique Arabic words, covering all letter shapes, collected from 907 writers. The dataset is designed for closed-vocabulary word recognition and to support segmented letter recognition-based systems. Why it matters: This dataset can help advance Arabic handwriting recognition systems, addressing a need for high-quality Arabic datasets in NLP research.

Window-Based Descriptors for Arabic Handwritten Alphabet Recognition: A Comparative Study on a Novel Dataset

arXiv ·

This paper introduces a novel dataset for Arabic handwritten isolated alphabet letters to serve as a benchmark for future research. The study presents a comparative evaluation of window-based descriptors for Arabic handwritten alphabet recognition, testing different descriptors with various classifiers. The experiments demonstrate that window-based descriptors perform well, especially when combined with a novel spatial pyramid partitioning scheme. Why it matters: The new dataset and analysis of descriptors will help advance Arabic OCR and handwritten text recognition systems.

A Panoramic Survey of Natural Language Processing in the Arab World

arXiv ·

This survey paper reviews the landscape of Natural Language Processing (NLP) research and applications in the Arab world. It discusses the unique challenges posed by the Arabic language, such as its morphological complexity and dialectal diversity. The paper also presents a historical overview of Arabic NLP and surveys various research areas, including machine translation, sentiment analysis, and speech recognition. Why it matters: The survey provides a comprehensive resource for researchers and practitioners interested in the current state and future directions of Arabic NLP, a field critical for enabling AI technologies to serve Arabic-speaking communities.

Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

arXiv ·

The paper addresses the challenge of missing diacritics in Arabic NLP by exploring naturally occurring diacritics in a new dataset across six genres. It maps partially diacritized words to their full diacritization and proposes extensions to the analyze-and-disambiguate approach. The extended diacritization algorithm achieves notable improvements, and the code/datasets are released as open source. Why it matters: This research provides valuable resources and methods for improving Arabic text processing, especially in contexts where diacritization is crucial for accurate interpretation.

Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

arXiv ·

This paper introduces a large-scale historical corpus of written Arabic spanning 1400 years. The corpus was cleaned and processed using Arabic NLP tools, including identification of reused text. The study uses a novel automatic periodization algorithm to study the history of the Arabic language, confirming the division into Modern Standard and Classical Arabic. Why it matters: This resource enables further computational research into the evolution of Arabic and the development of NLP tools for historical texts.