The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.
This paper introduces AraLLaMA, a new Arabic large language model (LLM) trained using a progressive vocabulary expansion method inspired by second language acquisition. The model utilizes a modified byte-pair encoding (BPE) algorithm to dynamically extend the Arabic subwords in its vocabulary during training, balancing the out-of-vocabulary (OOV) ratio. Experiments show AraLLaMA achieves performance comparable to existing Arabic LLMs on various benchmarks, and all models, data, and code will be open-sourced. Why it matters: This work addresses the need for more accessible and performant Arabic LLMs, contributing to democratization of AI in the Arab world.
This paper explores the impact of tokenization strategies and vocabulary sizes on Arabic language model performance across NLP tasks like news classification and sentiment analysis. It compares four tokenizers, finding that Byte Pair Encoding (BPE) with Farasa performs best overall due to its morphological analysis capabilities. The study surprisingly found limited impact of vocabulary size on performance with fixed model sizes, challenging assumptions about vocabulary size and model performance. Why it matters: The findings provide insights for developing more effective and nuanced Arabic language models, particularly for handling dialectal variations and promoting responsible AI development in the region.
The paper introduces Sparse-Quantized Representation (SpQR), a new compression format and quantization technique for large language models (LLMs). SpQR identifies outlier weights and stores them in higher precision while compressing the remaining weights to 3-4 bits. The method achieves less than 1% accuracy loss in perplexity for LLaMA and Falcon LLMs and enables a 33B parameter LLM to run on a single 24GB consumer GPU. Why it matters: This enables near-lossless compression of LLMs, making powerful models accessible on resource-constrained devices and accelerating inference without significant accuracy degradation.
MBZUAI PhD graduate William de Vazelhes is researching hard-thresholding algorithms to enable AI to work from smaller datasets. His work focuses on optimization algorithms that simplify data, making it easier to analyze and work with, useful for energy-saving and deploying AI models on low-memory devices. He demonstrated that his approach can obtain results similar to those of convex algorithms in many usual settings. Why it matters: This research could broaden AI accessibility by reducing computational costs, and has potential applications in sectors like finance, particularly for portfolio management under budgetary constraints.