Skip to content
GCC AI Research

Search

Results for "Qwen3"

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv ·

The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.

Resource-Aware Arabic LLM Creation: Model Adaptation, Integration, and Multi-Domain Testing

arXiv ·

Researchers fine-tuned the Qwen2-1.5B model for Arabic using QLoRA on a 4GB VRAM system, using datasets like Bactrian and Arabic Wikipedia. They addressed challenges in Arabic NLP including morphology and dialectal variations. After 10,000 training steps, the final loss converged to 0.1083 with improved handling of Arabic-specific linguistic phenomena. Why it matters: This demonstrates a resource-efficient approach for creating specialized Arabic language models, democratizing access to advanced NLP technologies.