Skip to content
GCC AI Research

Search

Results for "normalization"

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv ·

The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.

Moving into a new normal

KAUST ·

KAUST is gradually reopening its campus after a period of lockdown, following the Saudi government's lifting of the curfew. The reopening plan incorporates best practices learned from universities worldwide and considers the evolving higher education and research landscape. KAUST has implemented comprehensive COVID-19 health and safety procedures across various aspects of life on campus. Why it matters: This measured reopening signals a return to normalcy for research and academic activities at KAUST, while prioritizing the health and safety of its community.