Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks

arXiv · November 2, 2024 · Significant research

Summary

Researchers introduce Swan, a family of Arabic-centric embedding models including Swan-Small (based on ARBERTv2) and Swan-Large (based on ArMistral). They also propose ArabicMTEB, a benchmark suite for cross-lingual, multi-dialectal Arabic text embedding performance across 8 tasks and 94 datasets. Swan-Large achieves state-of-the-art results, outperforming Multilingual-E5-large in most Arabic tasks. Why it matters: The new models and benchmarks address a critical need for high-quality Arabic language models that are both dialectally and culturally aware, enabling more effective NLP applications in the region.

Keywords

Arabic NLP · embedding models · cross-lingual · multi-dialectal · ArabicMTEB

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.