The paper introduces AraModernBERT, an adaptation of the ModernBERT encoder architecture for Arabic, focusing on transtokenized embedding initialization and long-context modeling up to 8,192 tokens. Transtokenization is shown to be crucial for Arabic language modeling, significantly enhancing masked language modeling performance. The model demonstrates stable and effective long-context modeling, improving intrinsic language modeling performance at extended sequence lengths. Why it matters: This research provides practical insights for adapting modern encoder architectures to Arabic and other languages using Arabic-derived scripts, advancing Arabic NLP.
The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.
The paper introduces a two-step approach for transliterating Judeo-Arabic text (written in Hebrew script) into Arabic script. The method involves character-level mapping followed by post-correction to fix grammatical and orthographic errors. The authors also benchmarked LLMs on the transliteration task and demonstrate that transliteration enables the use of Arabic NLP tools on Judeo-Arabic. Why it matters: This work makes Judeo-Arabic texts more accessible to Arabic NLP, enabling processing and analysis that was previously impossible.
This paper introduces SimulMask, a new paradigm for fine-tuning large language models (LLMs) for simultaneous translation. SimulMask utilizes a novel attention masking approach that models simultaneous translation during fine-tuning by masking attention for a desired decision policy. Applied to a Falcon LLM on the IWSLT 2017 dataset, SimulMask achieves improved translation quality compared to state-of-the-art prompting optimization strategies across five language pairs while reducing computational cost. Why it matters: The proposed method offers a more efficient way to adapt LLMs for real-time translation, potentially enhancing multilingual communication tools and services.
The paper introduces ArabianGPT, a suite of transformer-based language models designed specifically for Arabic, including versions with 0.1B and 0.3B parameters. A key component is the AraNizer tokenizer, tailored for Arabic script's morphology. Fine-tuning ArabianGPT-0.1B achieved 95% accuracy in sentiment analysis, up from 56% in the base model, and improved F1 scores in summarization. Why it matters: The models address the gap in native Arabic LLMs, offering better performance on Arabic NLP tasks through tailored architecture and tokenization.