MBZUAI is conducting research to improve cross-cultural understanding using AI, including studying LLM limitations in recognizing cultural references. They developed "Culturally Yours," a tool that helps users comprehend cultural references in text, and the "All Languages Matter Benchmark" (ALM Bench) to evaluate multimodal LLMs across 100 languages. MBZUAI has also developed LLMs tailored to low-resource languages like Jais (Arabic), Nanda (Hindi), and Sherkala (Kazakh). Why it matters: These initiatives promote inclusivity and ensure AI systems are culturally aware and can serve diverse populations effectively, particularly in the Middle East's multicultural context.
MBZUAI researchers, in collaboration with over 70 researchers, have created the Culturally diverse Visual Question Answering (CVQA) benchmark to evaluate cultural understanding in multimodal LLMs. The CVQA dataset includes over 10,000 questions in 31 languages and 13 scripts, testing models on images of local dishes, personalities, and monuments. Testing of several multimodal LLMs on the CVQA benchmark revealed significant challenges, even for top models. Why it matters: This benchmark highlights the need for AI models to better understand diverse cultures, promoting fairness and relevance across different languages and regions.
MBZUAI researchers presented a method for cross-cultural transfer learning to improve language models' understanding of diverse Arab cultures. They used in-context learning and demonstration-based reinforcement (DITTO) to transfer cultural knowledge between countries. Experiments showed up to 34% improvement in performance on cultural understanding benchmarks using only a few demonstrations. Why it matters: This research addresses the gap in cultural understanding of Arabic language models, especially for smaller Arab countries, and provides a novel transfer learning approach.
Researchers from MBZUAI, University of Washington, and other institutions presented studies at EMNLP 2024 exploring how LLMs represent cultures. A survey analyzed dozens of recent studies on LLMs and culture and proposes a new framework for future research. The survey found that there is no widely accepted definition of 'culture' in NLP, making it challenging to interpret how models represent culture through language. Why it matters: This highlights a key gap in the field and emphasizes the need for a more rigorous and consistent understanding of culture in AI, especially as LLMs become more globally integrated.
The paper introduces SaudiCulture, a new benchmark for evaluating the cultural competence of LLMs within Saudi Arabia, covering five major geographical regions and diverse cultural domains. The benchmark includes questions of varying complexity and distinguishes between common and specialized regional knowledge. Evaluations of five LLMs (GPT-4, Llama 3.3, FANAR, Jais, and AceGPT) revealed performance declines on region-specific questions, highlighting the need for region-specific knowledge in LLM training.
The paper introduces FanarGuard, a bilingual moderation filter for Arabic and English language models that considers both safety and cultural alignment. A dataset of 468K prompt-response pairs was created and scored by LLM judges on harmlessness and cultural awareness to train the filter. The first benchmark targeting Arabic cultural contexts was developed to evaluate cultural alignment. Why it matters: FanarGuard advances context-sensitive AI safeguards by integrating cultural awareness into content moderation, addressing a critical gap in current alignment techniques.
MBZUAI researchers presented two studies at NAACL 2025 concerning how LLMs understand cultural differences, with one study winning the SAC award. One study, titled "Reading between the lines: Can LLMs identify cross-cultural communication gaps," assesses GPT-4o's ability to identify cultural references in Goodreads book reviews. The researchers created a benchmark dataset using annotations from 50 evaluators across different cultures to measure the LLM's ability to identify culture-specific items (CSIs). Why it matters: Improving LLMs' cross-cultural understanding is crucial for ensuring these models can be used effectively and equitably across diverse global contexts.
A new paper from MBZUAI introduces JEEM, a benchmark dataset for evaluating vision-language models on their understanding of images grounded in four Arabic-speaking societies (Jordan, UAE, Egypt, and Morocco) and their ability to use local dialects. The dataset comprises 2,178 images and 10,890 question-answer pairs reflecting everyday life and culturally specific scenes. Evaluation of several Arabic-capable models (Maya, PALO, Peacock, AIN, AyaV) and GPT-4o revealed that while models can generate fluent language, they struggle with genuine understanding, consistency, and relevance, especially when cultural context is important. Why it matters: This research highlights the challenges of building AI systems that can truly understand and interact with diverse cultures, emphasizing the need for culturally grounded datasets and evaluation metrics.