AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

arXiv · February 10, 2026 · Notable

Summary

The paper introduces AraModernBERT, an adaptation of the ModernBERT encoder architecture for Arabic, focusing on transtokenized embedding initialization and long-context modeling up to 8,192 tokens. Transtokenization is shown to be crucial for Arabic language modeling, significantly enhancing masked language modeling performance. The model demonstrates stable and effective long-context modeling, improving intrinsic language modeling performance at extended sequence lengths. Why it matters: This research provides practical insights for adapting modern encoder architectures to Arabic and other languages using Arabic-derived scripts, advancing Arabic NLP.

Keywords

AraModernBERT · Arabic · transtokenization · long-context modeling · language modeling

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.