Skip to content
GCC AI Research

Search

Results for "BPE algorithm"

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv ·

The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.

Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion

arXiv ·

This paper introduces AraLLaMA, a new Arabic large language model (LLM) trained using a progressive vocabulary expansion method inspired by second language acquisition. The model utilizes a modified byte-pair encoding (BPE) algorithm to dynamically extend the Arabic subwords in its vocabulary during training, balancing the out-of-vocabulary (OOV) ratio. Experiments show AraLLaMA achieves performance comparable to existing Arabic LLMs on various benchmarks, and all models, data, and code will be open-sourced. Why it matters: This work addresses the need for more accessible and performant Arabic LLMs, contributing to democratization of AI in the Arab world.

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

arXiv ·

This paper explores the impact of tokenization strategies and vocabulary sizes on Arabic language model performance across NLP tasks like news classification and sentiment analysis. It compares four tokenizers, finding that Byte Pair Encoding (BPE) with Farasa performs best overall due to its morphological analysis capabilities. The study surprisingly found limited impact of vocabulary size on performance with fixed model sizes, challenging assumptions about vocabulary size and model performance. Why it matters: The findings provide insights for developing more effective and nuanced Arabic language models, particularly for handling dialectal variations and promoting responsible AI development in the region.

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

arXiv ·

The paper introduces Sparse-Quantized Representation (SpQR), a new compression format and quantization technique for large language models (LLMs). SpQR identifies outlier weights and stores them in higher precision while compressing the remaining weights to 3-4 bits. The method achieves less than 1% accuracy loss in perplexity for LLaMA and Falcon LLMs and enables a 33B parameter LLM to run on a single 24GB consumer GPU. Why it matters: This enables near-lossless compression of LLMs, making powerful models accessible on resource-constrained devices and accelerating inference without significant accuracy degradation.

Developing efficient algorithms to spread the benefits of AI

MBZUAI ·

MBZUAI PhD graduate William de Vazelhes is researching hard-thresholding algorithms to enable AI to work from smaller datasets. His work focuses on optimization algorithms that simplify data, making it easier to analyze and work with, useful for energy-saving and deploying AI models on low-memory devices. He demonstrated that his approach can obtain results similar to those of convex algorithms in many usual settings. Why it matters: This research could broaden AI accessibility by reducing computational costs, and has potential applications in sectors like finance, particularly for portfolio management under budgetary constraints.

An Accurate Arabic Root-Based Lemmatizer for Information Retrieval Purposes

arXiv ·

This paper introduces a new non-statistical Arabic lemmatizer algorithm designed for information retrieval systems. The lemmatizer leverages Arabic language knowledge resources to generate accurate lemma forms and relevant features. The algorithm achieves a maximum accuracy of 94.8% and 89.15% on first seen documents, outperforming the Stanford Arabic model's 76.7% on the same dataset. Why it matters: Accurate Arabic lemmatization is crucial for improving the performance of Arabic information retrieval systems, which can enhance access to Arabic language content.

Award-winning algorithm aids observation

KAUST ·

KAUST researchers developed a machine learning algorithm to control a deformable mirror within the Subaru Telescope's exoplanet imaging camera, compensating for atmospheric turbulence. The algorithm, which computes a partial singular value decomposition (SVD), outperforms a standard SVD by a factor of four. The KAUST team received a best paper award at the PASC Conference for this work, which has already been deployed at the Subaru Telescope. Why it matters: This advancement enables sharper images of exoplanets, facilitating their identification and study, and showcases the impact of optimizing core linear algebra algorithms.