Skip to content
GCC AI Research

Search

Results for "SentencePiece"

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv ·

The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.

Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging

arXiv ·

This paper explores language-independent alternatives to morphological segmentation for Arabic NLP using data-driven sub-word units, characters as a unit of learning, and word embeddings learned using a character CNN. The study evaluates these methods on machine translation and POS tagging tasks. Results show these methods achieve performance close to or surpassing state-of-the-art approaches. Why it matters: By offering simpler, more adaptable segmentation techniques, this research can help improve Arabic NLP applications across diverse domains and dialects.

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

arXiv ·

This paper explores the impact of tokenization strategies and vocabulary sizes on Arabic language model performance across NLP tasks like news classification and sentiment analysis. It compares four tokenizers, finding that Byte Pair Encoding (BPE) with Farasa performs best overall due to its morphological analysis capabilities. The study surprisingly found limited impact of vocabulary size on performance with fixed model sizes, challenging assumptions about vocabulary size and model performance. Why it matters: The findings provide insights for developing more effective and nuanced Arabic language models, particularly for handling dialectal variations and promoting responsible AI development in the region.

AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

arXiv ·

The paper introduces AraModernBERT, an adaptation of the ModernBERT encoder architecture for Arabic, focusing on transtokenized embedding initialization and long-context modeling up to 8,192 tokens. Transtokenization is shown to be crucial for Arabic language modeling, significantly enhancing masked language modeling performance. The model demonstrates stable and effective long-context modeling, improving intrinsic language modeling performance at extended sequence lengths. Why it matters: This research provides practical insights for adapting modern encoder architectures to Arabic and other languages using Arabic-derived scripts, advancing Arabic NLP.

Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion

arXiv ·

This paper introduces AraLLaMA, a new Arabic large language model (LLM) trained using a progressive vocabulary expansion method inspired by second language acquisition. The model utilizes a modified byte-pair encoding (BPE) algorithm to dynamically extend the Arabic subwords in its vocabulary during training, balancing the out-of-vocabulary (OOV) ratio. Experiments show AraLLaMA achieves performance comparable to existing Arabic LLMs on various benchmarks, and all models, data, and code will be open-sourced. Why it matters: This work addresses the need for more accessible and performant Arabic LLMs, contributing to democratization of AI in the Arab world.

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

arXiv ·

The paper introduces Sparse-Quantized Representation (SpQR), a new compression format and quantization technique for large language models (LLMs). SpQR identifies outlier weights and stores them in higher precision while compressing the remaining weights to 3-4 bits. The method achieves less than 1% accuracy loss in perplexity for LLaMA and Falcon LLMs and enables a 33B parameter LLM to run on a single 24GB consumer GPU. Why it matters: This enables near-lossless compression of LLMs, making powerful models accessible on resource-constrained devices and accelerating inference without significant accuracy degradation.