Skip to content
GCC AI Research

Search

Results for "transtokenization"

AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

arXiv ·

The paper introduces AraModernBERT, an adaptation of the ModernBERT encoder architecture for Arabic, focusing on transtokenized embedding initialization and long-context modeling up to 8,192 tokens. Transtokenization is shown to be crucial for Arabic language modeling, significantly enhancing masked language modeling performance. The model demonstrates stable and effective long-context modeling, improving intrinsic language modeling performance at extended sequence lengths. Why it matters: This research provides practical insights for adapting modern encoder architectures to Arabic and other languages using Arabic-derived scripts, advancing Arabic NLP.

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv ·

The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.

A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic

arXiv ·

The paper introduces a two-step approach for transliterating Judeo-Arabic text (written in Hebrew script) into Arabic script. The method involves character-level mapping followed by post-correction to fix grammatical and orthographic errors. The authors also benchmarked LLMs on the transliteration task and demonstrate that transliteration enables the use of Arabic NLP tools on Judeo-Arabic. Why it matters: This work makes Judeo-Arabic texts more accessible to Arabic NLP, enabling processing and analysis that was previously impossible.

Simultaneous Masking, Not Prompting Optimization: A Paradigm Shift in Fine-tuning LLMs for Simultaneous Translation

arXiv ·

This paper introduces SimulMask, a new paradigm for fine-tuning large language models (LLMs) for simultaneous translation. SimulMask utilizes a novel attention masking approach that models simultaneous translation during fine-tuning by masking attention for a desired decision policy. Applied to a Falcon LLM on the IWSLT 2017 dataset, SimulMask achieves improved translation quality compared to state-of-the-art prompting optimization strategies across five language pairs while reducing computational cost. Why it matters: The proposed method offers a more efficient way to adapt LLMs for real-time translation, potentially enhancing multilingual communication tools and services.

ArabianGPT: Native Arabic GPT-based Large Language Model

arXiv ·

The paper introduces ArabianGPT, a suite of transformer-based language models designed specifically for Arabic, including versions with 0.1B and 0.3B parameters. A key component is the AraNizer tokenizer, tailored for Arabic script's morphology. Fine-tuning ArabianGPT-0.1B achieved 95% accuracy in sentiment analysis, up from 56% in the base model, and improved F1 scores in summarization. Why it matters: The models address the gap in native Arabic LLMs, offering better performance on Arabic NLP tasks through tailored architecture and tokenization.