Skip to content
GCC AI Research

Search

Results for "Arabic tokenization"

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv ·

The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

arXiv ·

This paper explores the impact of tokenization strategies and vocabulary sizes on Arabic language model performance across NLP tasks like news classification and sentiment analysis. It compares four tokenizers, finding that Byte Pair Encoding (BPE) with Farasa performs best overall due to its morphological analysis capabilities. The study surprisingly found limited impact of vocabulary size on performance with fixed model sizes, challenging assumptions about vocabulary size and model performance. Why it matters: The findings provide insights for developing more effective and nuanced Arabic language models, particularly for handling dialectal variations and promoting responsible AI development in the region.