Researchers at MBZUAI have developed a new automatic method to examine cross-lingual abilities in multilingual language models, testing 10 models across 16 languages. They combined beam search with language-model-based simulation, generating 6,000 bilingual question pairs and found significant performance drops compared to English, even in high-resource languages like Chinese. The method introduces perturbations to test the models' ability to transfer knowledge rather than rely on memorization. Why it matters: This research highlights critical gaps in cross-lingual AI, providing a framework for developing more equitable and effective multilingual models, especially for Arabic and other under-represented languages.
Researchers introduce a benchmark to evaluate the factual recall and knowledge transferability of multilingual language models across 13 languages. The study reveals that language models often fail to transfer knowledge between languages, even when they possess the correct information in one language. The benchmark and evaluation framework are released to drive future research in multilingual knowledge transfer.
Project LITMUS explores predicting cross-lingual transfer accuracy in multilingual language models, even without test data in target languages. The goal is to estimate model performance in low-resource languages and optimize training data for desired cross-lingual performance. This research aims to identify factors influencing cross-lingual transfer, contributing to linguistically fair MMLMs. Why it matters: Improving cross-lingual transfer is vital for creating more equitable and effective multilingual AI systems, especially for languages with limited resources.
This paper explores cross-lingual transfer in Arabic language models, which are typically pretrained on Modern Standard Arabic (MSA) but expected to generalize to diverse dialects. The study uses probing on 3 NLP tasks and representational similarity analysis to assess transfer effectiveness. Results show transfer is uneven across dialects, partially linked to geographic proximity, and models trained on all dialects exhibit negative interference. Why it matters: The findings highlight challenges in cross-lingual transfer for Arabic NLP and raise questions about dialect similarity for model training.
This paper introduces a novel evaluation framework for Arabic language models, addressing gaps in linguistic accuracy and cultural alignment. The authors analyze existing datasets and present the Arabic Depth Mini Dataset (ADMD), a curated collection of 490 questions across ten domains. Evaluating GPT-4, Claude 3.5 Sonnet, Gemini Flash 1.5, CommandR 100B, and Qwen-Max using ADMD reveals performance variations, with Claude 3.5 Sonnet achieving the highest accuracy at 30%. Why it matters: The work emphasizes the importance of cultural competence in Arabic language model evaluation, providing practical insights for improvement.