AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3
arXiv · · Significant research
Summary
The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.
Keywords
Arabic tokenization · SentencePiece · normalization · language extension · Qwen3
Get the weekly digest
Top AI stories from the GCC region, every week.